QUERYING (BIG) DATA ON NOSQL STORES

Size: px
Start display at page:

Download "QUERYING (BIG) DATA ON NOSQL STORES"

Transcription

1 QUERYING (BIG) DATA ON NOSQL STORES GENOVEVA VARGAS SOLAR FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE

2 DATA PROCESSING Process the data to produce other data: analysis tool, business intelligence tool,... This means Handle large volumes of data Manage thousands of processors Parallelize and distribute treatments Scheduling I/O Managing Fault Tolerance Monitor /Control processes Map-Reduce provides all this easy! 2

3 NOSQL QUERY EXECUTION 2 Query programming À Map Query Reduce MAP-REDUCE PROGRAMMING PATTERNS 3 Query execution - EXECUTION MODEL - DATA PROCESSING PROPERTIES all'updates' made'to'the'master' Optimizing query processing à DATA COMPRESSION, LOCATION, R/W STRATEGIES, CACHE/MEMORY ORGANIZATION Master' changes'propagate'' To'slaves' NoSQL reads'can'be'done' from'master'or'slaves' 1 Data organization - DISTRIBUTED FILE SYSTEM - INDEXING Slaves' 3

4 ORGANISATION DE DONNÉES: THE NOSQL CASE 4

5 NOSQL STORES: DATA MANAGEMENT PROPERTIES Indexing Distributed hashing like Memcached open source cache In-memory indexes are scalable when distributing and replicating objects over multiple nodes Partitioned tables High availability and scalability: eventual consistency Data fetched are not guaranteed to be up-to-date Updates are guaranteed to be propagated to all nodes eventually Shared nothing horizontal scaling Replicating and partitioning data over many servers Support large number of simple read/write operations per second (OLTP) No ACID guarantees Updates eventually propagated but limited guarantees on reads consistency BASE: basically available; soft state, eventually consistent Multi-version concurrency control 5

6 PROBLEM STATEMENT: HOW MUCH TO GIVE UP? Fault- tolerant partitioning Availability Consistency CAP theorem 1 : a system can have two of the three properties NoSQL systems sacrifice consistency 1 Eric Brewer, "Towards robust distributed systems." PODC /PODC- keynote.pdf 6

7 NOSQL STORES: AVAILABILITY AND PERFORMANCE Replication Sharding Copy data across multiple servers (each bit of data can be found in multiple servers) Increase data availability Faster query evaluation Distribute different data across multiple servers Each server acts as the single source of a data subset Orthogonal techniques 7

8 REPLICATION: PROS & CONS Data is more available Failure of a site containing E does not result in unavailability of E if replicas exist Performance Parallelism: queries processed in parallel on several nodes Reduce data transfer for local data Increased updates cost Synchronisation: each replica must be updated Increased complexity of concurrency control Concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented 8

9 SHARDING: WHY IS IT USEFUL? Scaling applications by reducing data sets in any single databases Segregating data Sharing application data Securing sensitive data by isolating it Load%balancer% Web%3% Cache%1% Improve read and write performance Smaller amount of data in each user group implies faster querying Isolating data into smaller shards accessed data is more likely to stay on cache More write bandwidth: writing can be done in parallel Smaller data sets are easier to backup, restore and manage Massively work done Parallel work: scale out across more nodes Web%1% Web%2% Cache%2% Parallel backend: handling higher user loads Share nothing: very few bottlenecks Decrease resilience improve availability If a box goes down others still operate Cache%3% But: Part of the data missing MySQL% Master% MySQL% Resume%database% Master% Site%database% 9

10 SHARDING AND REPLICATION Sharding with no replication: unique copy, distributed data sets (+) Better concurrency levels (shards are accessed independently) (-) Cost of checking constraints, rebuilding aggregates Ensure that queries and updates are distributed across shards Replication of shards (+) Query performance (availability) (-) Cost of updating, of checking constraints, complexity of concurrency control Partial replication (most of the times) Only some shards are duplicated 10

11 QUERY PROGRAMMING DATA PROCESSING USING MAP-REDUCE 11

12 MAP-REDUCE Programming model for expressing distributed computations on massive amounts of data Execution framework for large-scale data processing on clusters of commodity servers Market: any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing content must tackle large-data problems data- intensive processing is beyond the capability of any individual machine and requires clusters large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of machines «Data represent the rising tide that lifts all boats more data lead to better algorithms and systems for solving real-world problems» 12

13 COUNTING WORDS (URI, document) à (term, count) see bob throw see spot run see 1 bob 1 throw 1 see 1 spot 1 run 1 bob <1> run <1> see <1,1> spot <1> throw <1> bob 1 run 1 see 2 spot 1 throw 1 Map Shuffle/Sort Reduce 13

14 MAP REDUCE DESIGN PATTERNS SUMMARIZATION Numerical Minimum, maximum, count, average, median-standard deviation Inverted index Wikipedia inverted index Counting with counters Count number of records, a small number of unique instances, summations Number of users per state FILTERING Filtering Closer view of data, tracking event threads, distributed grep, data cleansing, simple random sampling, remove low scoring data Bloom Remove most of nonwatched values, prefiltering data for a set membership check Hot list, Hbase query Top ten Outlier analysis, select interesting data, catchy dashbords Top ten users by reputation Distinct Deduplicate data, getting distinct values, protecting from inner join explosion Distinct user ids DATA ORGANIZATION Structured to hierarchical Prejoining data, preparing data for Hbase or MongoDB Post/comment building for StackOverflow, Question/Answer building Partitioning Partitioning users by last access date Binning Binning by Hadoop-related tags Total order sorting Sort users by last visit Shuffling Anonymizing StackOverflow comments JOIN Reduce side join Multiple large data sets joined by foreign key User comment join Reduce side join with bloom filter Reputable user comment join Replicated join Replicated user comment join Composite join Composite user comment join Cartesian product Comment comparison 14

15 ELEMENTS TO THINK ABOUT EFFICIENT EXECUTION 15

16 HADOOP INFRASTRUCTURE 16

17 HADOOP FRAMEWORK Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters HBase: A scalable, distributed database that supports structured data storage for large tables Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying Chukwa: A data collection system for managing large distributed systems Pig: A high-level data-flow language and execution framework for parallel computation ZooKeeper: A high-performance coordination service for distributed applications 17

18 DISTRIBUTED FILE SYSTEM Abandons the separation of computation and storage as distinct components in a cluster Google File System (GFS) supports Google s proprietary implementation of MapReduce; In the open-source world, HDFS (Hadoop Distributed File System) is an open-source implementation of GFS that supports Hadoop The main idea is to divide user data into blocks and replicate those blocks across the local disks of nodes in the cluster Adopts a master slave architecture Master (namenode HDFS) maintains the file namespace (metadata, directory structure, file to block mapping, location of blocks, and access permissions) Slaves (datanode HDFS) manage the actual data blocks 18

19 HFDS GENERAL ARCHITECTURE An application client wishing to read a file (or a portion thereof) must first contact the namenode to determine where the actual data is stored The namenode returns the relevant block id and the location where the block is held (i.e., which datanode) The client then contacts the datanode to retrieve the data. HDFS lies on top of the standard OS stack (e.g., Linux): blocks are stored on standard single-machine file systems 19

20 HDFS PROPERTIES HDFS stores three separate copies of each data block to ensure both reliability, availability, and performance In large clusters, the three replicas are spread across different physical racks, HDFS is resilient towards two common failure scenarios individual datanode crashes and failures in networking equipment that bring an entire rack offline. Replicating blocks across physical machines also increases opportunities to co-locate data and processing in the scheduling of MapReduce jobs, since multiple copies yield more opportunities to exploit locality To create a new file and write data to HDFS The application client first contacts the namenode The namenode updates the file namespace after checking permissions and making sure the file doesn t already exist allocates a new block on a suitable datanode The application is directed to stream data directly to it From the initial datanode, data is further propagated to additional replicas 20

21 HADOOP CLUSTER ARCHITECTURE The HDFS namenode runs the namenode daemon The job submission node runs the jobtracker, which is the single point of contact for a client wishing to execute a MapReduce job The jobtracker Monitors the progress of running MapReduce jobs Is responsible for coordinating the execution of the mappers and reducers Tries to take advantage of data locality in scheduling map tasks 21

22 MAP-REDUCE PHASES Initialisation Map: record reader, mapper, combiner, and partitioner Reduce: shuffle, sort, reducer, and output format Partition input (key, value) pairs into chunks run map() tasks in parallel After all map() s have been completed consolidate the values for each unique emitted key Partition space of output map keys, and run reduce() in parallel 22

23 MAP SUB-PHASES Record reader translates an input split generated by input format into records parse the data into records, but not parse the record itself It passes the data to the mapper in the form of a key/value pair. Usually the key in this context is positional information and the value is the chunk of data that composes a record Map user-provided code is executed on each key/value pair from the record reader to produce zero or more new key/value pairs, called the intermediate pairs The key is what the data will be grouped on and the value is the information pertinent to the analysis in the reducer Combiner, an optional localized reducer Can group data in the map phase It takes the intermediate keys from the mapper and applies a user-provided method to aggregate values in the small scope of that one mapper Partitioner takes the intermediate key/value pairs from the mapper (or combiner) and splits them up into shards, one shard per reducer 23

24 REDUCE SUB PHASES Shuffle and sort takes the output files written by all of the partitioners and downloads them to the local machine in which the reducer is running. These individual data pieces are then sorted by key into one larger data list The purpose of this sort is to group equivalent keys together so that their values can be iterated over easily in the reduce task Reduce takes the grouped data as input and runs a reduce function once per key grouping The function is passed the key and an iterator over all of the values associated with that key Once the reduce function is done, it sends zero or more key/value pair to the final step, the output format Output format translates the final key/value pair from the reduce function and writes it out to a file by a record writer 24

25 CASE STUDY 25

26 EXTENSIBLE RECORD STORES Basic data model is rows and columns Basic scalability model is splitting rows and columns over multiple nodes SYSTEM ADDRESS Rows split across nodes through sharding on the primary key Split by range rather than hash function HBase hbase.apache.com Rows analogous to documents: variable number of attributes, attribute names must be unique HyperTable hypertable.org Grouped into collections (tables) Cassandra incubator.apache.org/cassandra Queries on ranges of values do not go to every node Columns are distributed over multiple nodes using column groups Which columns are best stored together Column groups must be pre-defined with the extensible record stores 26

27 EXTENSIBLE RECORD DATA MODEL (HBASE EXAMPLE) Most basic unit: column Each column may have multiple versions Each distinct value contained in a separate cell One or more columns form a row addressed uniquely by a row key Table T1 Family F- 1 Raw R- 1 Column C1 C2 Cell Version 1 Version 2 A number of rows form a table Raw R- n Family F- 2 Column Cell C C3 95 Version 1 Version 2 27

28 DATA ORGANIZATION 28

29 REFINEMENTS: LOCALITY GROUPS Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group.

30 REFINEMENTS: COMPRESSION Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1)

31 FILTER Given a collection of tuples, filtering simply evaluates each record separately and decides, based on some condition, whether it should stay or go Scan through a file line-by-line and only output lines that match a specific pattern Simple random sampling: grab a subset of our larger data set in which each record has an equal probability of being selected (decrease the dataset size) Instead of some filter criteria function that bears some relationship to the content of the record, a random number generator will produce a value, and if the value is below a threshold, keep the record. Otherwise, toss it out Bloom: keep records that are member of some predefined set of values (hot values) For each record, extract a feature of that record. If that feature is a member of a set of values represented by a Bloom filter, keep it; otherwise toss it out (or the reverse). For example: keep or throw away this record if the value in the user field is a member of a predetermined list of users. 31

32 BLOOM FILTER Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to demonstrate Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1 tutorial/ 32

33 A FORM OF OPTIMIZATION FOR ACCESSING HBASE SEMI-HANDS ON (SEE EXERCISE 4) 33

34 REFINEMENTS: BLOOM FILTERS Read operation has to read from disk when desired SSTable isn t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk

35 NOSQL DATA PROCESSING PROPERTIES only an afterthought and could cause problems once you need to scale the system. And if it does offer scalability, does it imply specific steps to do so? The easiest solution would be to add one machine at a time, while sharded setups (especially those not supporting virtual shards) sometimes require for each shard to be in- creased simultaneously because each partition needs to be equally powerful. Lars George; Hbase the definitive guide, O Reilly 35

36 NOSQL DATA PROCESSING PROPERTIES 36

37 37

38 SOME BOOKS Hadoop The Definitive Guide O Reily 2011 Tom White Data Intensive Text Processing with MapReduce Morgan & Claypool 2010 Jimmy Lin, Chris Dyer pages Cloud Computing and Software Services Theory and Techniques CRC Press Syed Ahson, Mohammad Ilyas pages Writing and Querying MapReduce Views in CouchDB O Reily 2011 Brandley Holt pages 5-29 NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by Pramod J. Sadalage, Martin Fowler 38

39 NOSQL STORES: AVAILABILITY AND PERFORMANCE 39

40 REPLICATION MASTER - SLAVE Master' all'updates' made'to'the'master' changes'propagate'' To'slaves' reads'can'be'done' from'master'or'slaves' Helps with read scalability but does not help with write scalability Read resilience: should the master fail, slaves can still handle read requests Master failure eliminates the ability to handle writes until either the master is restored or a new master is appointed Biggest complication is consistency Slaves' Makes one node the authoritative copy/replica that handles writes while replica synchronize with the master and may handle reeds All replicas have the same weight Possible write write conflict Attempt to update the same record at the same time from to different places Master is a bottle-neck and a point of failure Replicas can all accept writes The lose of one of them does not prevent access to the data store 40

41 MASTER-SLAVE REPLICATION MANAGEMENT Masters can be appointed Manually when configuring the nodes cluster Automatically: when configuring a nodes cluster one of them elected as master. The master can appoint a new master when the master fails reducing downtime Read resilience Read and write paths have to be managed separately to handle failure in the write path and still reads can occur Reads and writes are put in different database connections if the database library accepts it Replication comes inevitably with a dark side: inconsistency Different clients reading different slaves will see different values if changes have not been propagated to all slaves In the worst case a client cannot read a write it just made Even if master-slave is used for hot backups, if the master fails any updates on to the backup are lost 41

42 REPLICATION: PEER-TO-PEER Master' Allows writes to any node; the nodes coordinate to synchronize their copies The replicas have equal weight nodes'communicate' their'writes' all'nodes'read' and'write'all'data' Deals with inconsistencies Replicas coordinate to avoid conflict Network traffic cost for coordinating writes Unnecessary to make all replicas agree to write, only the majority Survival to the loss of the minority of replicas nodes Policy to merge inconsistent writes Full performance on writing to any replica 42

43 REPLICATION: ASPECTS TO CONSIDER Conditioning Performance Fault tolerance Important elements to consider Data to duplicate Copies location Duplication model (master slave / P2P) Consistency model (global copies) Transparency levels Availability à Find a compromise! 43

44 SHARDING Puts different data on separate nodes Each user only talks to one servicer so she gets rapid responses The load should be balanced out nicely between servers Ensure that data that is accessed together is clumped together on the same node that clumps are arranged on the nodes to provide best data access Each%shard%reads%and% writes%its%own%data% Ability to distribute both data and load of simple operations over many servers, with no RAM or disk shared among servers A way to horizontally scale writes Improve read performance Application/data store support 44

45 SHARDING Database laws Small databases are fast Big databases are slow Keep databases small Principle Start with a big monolithic database Break into smaller databases Across many clusters Using a key value Instead of having one million customers information on a single big machine customers on smaller and different machines 45

46 SHARDING CRITERIA Partitioning Relational: handled by the DBMS (homogeneous DBMS) NoSQL: based on ranging of the k-value Federation Relational Combine tables stored in different physical databases Easier with denormalized data NoSQL: Store together data that are accessed together Aggregates unit of distribution 46

47 SHARDING Architecture Each application server (AS) is running DBS/client Each shard server is running a database server replication agents and query agents for supporting parallel query functionality Process Pick a dimension that helps sharding easily (customers, countries, addresses) Pick strategies that will last a long time as repartition/ re-sharding of data is operationally difficult This is done according to two different principles Partitioning: a partition is a structure that divides a space into tow parts Federation: a set of things that together compose a centralized unit but each individually maintains some aspect of autonomy Customers data is partitioned by ID in shards using an algorithm d to determine which shard a customer ID belongs to 47

48 48

49 PARTITIONING A PARTITION IS A STRUCTURE THAT DIVIDES A SPACE INTO TOW PARTS 49

50 BACKGROUND: DISTRIBUTED RELATIONAL DATABASES External schemas (views) are often subsets of relations (contacts in Europe and America) Access defined on subsets of relations: 80% of the queries issued in a region have to do with contacts of that region Relations partition Better concurrency level Fragments accessed independently Implications Check integrity constraints Rebuild relations 50

51 FRAGMENTATION Horizontal Groups of tuples of the same relation Budget < or >= Vertical Not disjoint are more difficult to manage Groups attributes of the same relation Separate budget from loc and pname of the relation project Hybrid 51

52 FRAGMENTATION: RULES Vertical Clustering Splitting Grouping elementary fragments Budget and location information in two relations Decomposing a relation according to affinity relationships among attributes Horizontal Tuples of the same fragment must be statistically homogeneous If t1 and t2 are tuples of the same fragment then t1 and t2 have the same probability of being selected by a query Keep important conditions Complete Every tuple (attribute) belongs to a fragment (without information loss) If tuples where budget >= are more likely to be selected then it is a good candidate Minimum If no application distinguishes between budget >= and budget < then these conditions are unnecessary 52

53 SHARDING: HORIZONTAL PARTITIONING The entities of a database are split into two or more sets (by row) In relational: same schema several physical bases/ servers Partition contacts in Europe and America shards where they zip code indicates where the will be found Efficient if there exists some robust and implicit way to identify in which partition to find a particular entity Last resort shard Needs to find a sharding function: modulo, round robin, hash partition, range - partition Load%balancer% Web%1% MySQL% Master% Web%2% Web%3% MySQL% Master% MySQL% Slave%1% MySQL% Slave%2% Cache%1% Cache%2% MySQL% Slave%n% Cache%3% MySQL% Slave%1% MySQL% Slave%2% MySQL% Slave%n% Odd%IDs% Even%IDs% 53

54 FEDERATION A FEDERATION IS A SET OF THINGS THAT TOGETHER COMPOSE A CENTRALIZED UNIT BUT EACH INDIVIDUALLY MAINTAINS SOME ASPECT OF AUTONOMY 54

55 FEDERATION: VERTICAL SHARDING Load%balancer% Principle Partition data according to their logical affiliation Put together data that are commonly accessed The search load for the large partitioned entity can be split across multiple servers (logical and physical) and not only according to multiple indexes in the same logical server Web%1% Web%2% Web%3% MySQL% Master% Cache%1% Cache%2% Cache%3% Different schemas, systems, and physical bases/ servers Shards the components of a site and not only data MySQL% Master% MySQL% Slave%1% Internal% user% MySQL% Slave%1% MySQL% Slave%2% MySQL% Slave%n% Resume%database% Site%database% 55

56 NOSQL STORES: PERSISTENCY MANAGEMENT 56

57 «MEMCACHED» «memcached» is a memory management protocol based on a cache: Uses the key-value notion Information is completly stored in RAM «memcached» protocol for: Creating, retrieving, updating, and deleting information from the database Applications with their own «memcached» manager (Google, Facebook, YouTube, FarmVille, Twitter, Wikipedia) 57

58 STORAGE ON DISC (1) For efficiency reasons, information is stored using the RAM: Work information is in RAM in order to answer to low latency requests Yet, this is not always possible and desirable Ø The process of moving data from RAM to disc is called "eviction ; this process is configured automatically for every bucket 58

59 STORAGE ON DISC (2) NoSQL servers support the storage of key-value pairs on disc: Persistency can be executed by loading data, closing and reinitializing it without having to load data from another source Hot backups loaded data are sotred on disc so that it can be reinitialized in case of failures Storage on disc the disc is used when the quantity of data is higher thant the physical size of the RAM, frequently used information is maintained in RAM and the rest es stored on disc 59

60 STORAGE ON DISC (3) Strategies for ensuring: Each node maintains in RAM information on the key-value pairs it stores. Keys: may not be found, or they can be stored in memory or on disc The process of moving information from RAM to disc is asynchronous: The server can continue processing new requests A queue manages requests to disc Ø In periods with a lot of writing requests, clients can be notified that the server is termporaly out of memory until information is evicted 60

61 NOSQL STORES: CONCURRENCY CONTROL 61

62 MULTI VERSION CONCURRENCY CONTROL (MVCC) Objective: Provide concurrent access to the database and in programming languages to implement transactional memory Problem: If someone is reading from a database at the same time as someone else is writing to it, the reader could see a half-written or inconsistent piece of data. Lock: readers wait until the writer is done MVCC: Each user connected to the database sees a snapshot of the database at a particular instant in time Any changes made by a writer will not be seen by other users until the changes have been completed (until the transaction has been committed When an MVCC database needs to update an item of data it marks the old data as obsolete and adds the newer version elsewhere à multiple versions stored, but only one is the latest Writes can be isolated by virtue of the old versions being maintained Requires (generally) the system to periodically sweep through and delete the old, obsolete data objects 62

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Modern Database Concepts

Modern Database Concepts Modern Database Concepts Basic Principles Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz NoSQL Overview Main objective: to implement a distributed state Different objects stored on different

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Hadoop. copyright 2011 Trainologic LTD

Hadoop. copyright 2011 Trainologic LTD Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides

More information

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL CISC 7610 Lecture 5 Distributed multimedia databases Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL Motivation YouTube receives 400 hours of video per minute That is 200M hours

More information

Big Data and Scripting map reduce in Hadoop

Big Data and Scripting map reduce in Hadoop Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Database Architectures

Database Architectures Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/15/15 Agenda Check-in Parallelism and Distributed Databases Technology Research Project Introduction to NoSQL

More information

BigTable: A Distributed Storage System for Structured Data

BigTable: A Distributed Storage System for Structured Data BigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Extreme Computing. NoSQL.

Extreme Computing. NoSQL. Extreme Computing NoSQL PREVIOUSLY: BATCH Query most/all data Results Eventually NOW: ON DEMAND Single Data Points Latency Matters One problem, three ideas We want to keep track of mutable state in a scalable

More information

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Goal of the presentation is to give an introduction of NoSQL databases, why they are there. 1 Goal of the presentation is to give an introduction of NoSQL databases, why they are there. We want to present "Why?" first to explain the need of something like "NoSQL" and then in "What?" we go in

More information

CA485 Ray Walshe NoSQL

CA485 Ray Walshe NoSQL NoSQL BASE vs ACID Summary Traditional relational database management systems (RDBMS) do not scale because they adhere to ACID. A strong movement within cloud computing is to utilize non-traditional data

More information

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems

Jargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems Jargons, Concepts, Scope and Systems Key Value Stores, Document Stores, Extensible Record Stores Overview of different scalable relational systems Examples of different Data stores Predictions, Comparisons

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Distributed Systems 16. Distributed File Systems II

Distributed Systems 16. Distributed File Systems II Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS

More information

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.

April Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. 1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map

More information

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores CSE 444: Database Internals Lectures 26 NoSQL: Extensible Record Stores CSE 444 - Spring 2014 1 References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No. 4)

More information

BigTable. CSE-291 (Cloud Computing) Fall 2016

BigTable. CSE-291 (Cloud Computing) Fall 2016 BigTable CSE-291 (Cloud Computing) Fall 2016 Data Model Sparse, distributed persistent, multi-dimensional sorted map Indexed by a row key, column key, and timestamp Values are uninterpreted arrays of bytes

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS

4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS W13.A.0.0 CS435 Introduction to Big Data W13.A.1 FAQs Programming Assignment 3 has been posted PART 2. LARGE SCALE DATA STORAGE SYSTEMS DISTRIBUTED FILE SYSTEMS Recitations Apache Spark tutorial 1 and

More information

CS November 2017

CS November 2017 Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre

NoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre NoSQL systems: sharding, replication and consistency Riccardo Torlone Università Roma Tre Data distribution NoSQL systems: data distributed over large clusters Aggregate is a natural unit to use for data

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

HADOOP FRAMEWORK FOR BIG DATA

HADOOP FRAMEWORK FOR BIG DATA HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April

More information

A BigData Tour HDFS, Ceph and MapReduce

A BigData Tour HDFS, Ceph and MapReduce A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Bigtable. Presenter: Yijun Hou, Yixiao Peng Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng

More information

Chapter 5. The MapReduce Programming Model and Implementation

Chapter 5. The MapReduce Programming Model and Implementation Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing

More information

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems

NoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems CompSci 516 Data Intensive Computing Systems Lecture 21 (optional) NoSQL systems Instructor: Sudeepa Roy Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 1 Key- Value Stores Duke CS,

More information

Database Architectures

Database Architectures Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 11/15/12 Agenda Check-in Centralized and Client-Server Models Parallelism Distributed Databases Homework 6 Check-in

More information

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals References CSE 444: Database Internals Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol 39, No 4) Lectures 26 NoSQL: Extensible Record Stores Bigtable: A Distributed

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These

More information

Datacenter replication solution with quasardb

Datacenter replication solution with quasardb Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION

More information

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao

Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

CompSci 516 Database Systems

CompSci 516 Database Systems CompSci 516 Database Systems Lecture 20 NoSQL and Column Store Instructor: Sudeepa Roy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Reading Material NOSQL: Scalable SQL and NoSQL Data Stores Rick

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014

NoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014 NoSQL Databases Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 10, 2014 Amir H. Payberah (SICS) NoSQL Databases April 10, 2014 1 / 67 Database and Database Management System

More information

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E. 18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng

Bigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:

More information

MapReduce. U of Toronto, 2014

MapReduce. U of Toronto, 2014 MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs

CS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.0.0 CS435 Introduction to Big Data 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.1 FAQs Deadline of the Programming Assignment 3

More information

A Glimpse of the Hadoop Echosystem

A Glimpse of the Hadoop Echosystem A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other

More information

Big Data Analytics. Rasoul Karimi

Big Data Analytics. Rasoul Karimi Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1 Outline

More information

CS November 2018

CS November 2018 Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account

More information

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma Apache Hadoop Goes Realtime at Facebook Guide - Dr. Sunny S. Chung Presented By- Anand K Singh Himanshu Sharma Index Problem with Current Stack Apache Hadoop and Hbase Zookeeper Applications of HBase at

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

BigData and Map Reduce VITMAC03

BigData and Map Reduce VITMAC03 BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to

More information

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases

Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases Key-Value Document Column Family Graph John Edgar 2 Relational databases are the prevalent solution

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 11. Advanced Aspects of Big Data Management Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/

More information

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY

NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY WHAT IS NOSQL? Stands for No-SQL or Not Only SQL. Class of non-relational data storage systems E.g.

More information

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per

More information

GFS: The Google File System. Dr. Yingwu Zhu

GFS: The Google File System. Dr. Yingwu Zhu GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in

More information

Hadoop An Overview. - Socrates CCDH

Hadoop An Overview. - Socrates CCDH Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected

More information

Ghislain Fourny. Big Data 5. Wide column stores

Ghislain Fourny. Big Data 5. Wide column stores Ghislain Fourny Big Data 5. Wide column stores Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2 Where we are User interfaces

More information

Introduction to NoSQL Databases

Introduction to NoSQL Databases Introduction to NoSQL Databases Roman Kern KTI, TU Graz 2017-10-16 Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 1 / 31 Introduction Intro Why NoSQL? Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 2 / 31 Introduction

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system

More information

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2016 1 Google Chubby 2 Chubby Distributed lock service + simple fault-tolerant file system Interfaces File access

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2015 Lecture 14 NoSQL References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No.

More information

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data

More information

Distributed Systems. GFS / HDFS / Spanner

Distributed Systems. GFS / HDFS / Spanner 15-440 Distributed Systems GFS / HDFS / Spanner Agenda Google File System (GFS) Hadoop Distributed File System (HDFS) Distributed File Systems Replication Spanner Distributed Database System Paxos Replication

More information

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent

More information

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13 Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University

More information

Advanced Database Technologies NoSQL: Not only SQL

Advanced Database Technologies NoSQL: Not only SQL Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at

More information

Introduction to Distributed Data Systems

Introduction to Distributed Data Systems Introduction to Distributed Data Systems Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook January

More information

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these

More information

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams

CS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams Question 1 (Bigtable) What is an SSTable in Bigtable? Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams It is the internal file format used to store Bigtable data. It maps keys

More information

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017

HDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware

More information

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan COSC 416 NoSQL Databases NoSQL Databases Overview Dr. Ramon Lawrence University of British Columbia Okanagan ramon.lawrence@ubc.ca Databases Brought Back to Life!!! Image copyright: www.dragoart.com Image

More information

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.

CS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment. Distributed Systems 15. Distributed File Systems Google ( Apache Zookeeper) Paul Krzyzanowski Rutgers University Fall 2017 1 2 Distributed lock service + simple fault-tolerant file system Deployment Client

More information

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla

Big Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla Big Table Google s Storage Choice for Structured Data Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla Bigtable: Introduction Resembles a database. Does not support

More information

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23 Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE

More information

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018

Distributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018 Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams David Domingo Paul Krzyzanowski Rutgers University Fall 2018 November 28, 2018 1 Question 1 (Bigtable) What is an SSTable in

More information

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Ghislain Fourny. Big Data 5. Column stores

Ghislain Fourny. Big Data 5. Column stores Ghislain Fourny Big Data 5. Column stores 1 Introduction 2 Relational model 3 Relational model Schema 4 Issues with relational databases (RDBMS) Small scale Single machine 5 Can we fix a RDBMS? Scale up

More information

Architekturen für die Cloud

Architekturen für die Cloud Architekturen für die Cloud Eberhard Wolff Architecture & Technology Manager adesso AG 08.06.11 What is Cloud? National Institute for Standards and Technology (NIST) Definition On-demand self-service >

More information

NoSQL Databases. CPS352: Database Systems. Simon Miner Gordon College Last Revised: 4/22/15

NoSQL Databases. CPS352: Database Systems. Simon Miner Gordon College Last Revised: 4/22/15 NoSQL Databases CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/22/15 Agenda Check-in NoSQL Databases Aggregate databases Key-value, document, and column family Graph databases Related

More information

DIVING IN: INSIDE THE DATA CENTER

DIVING IN: INSIDE THE DATA CENTER 1 DIVING IN: INSIDE THE DATA CENTER Anwar Alhenshiri Data centers 2 Once traffic reaches a data center it tunnels in First passes through a filter that blocks attacks Next, a router that directs it to

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Introduction to BigData, Hadoop:-

Introduction to BigData, Hadoop:- Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,

More information

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS Questions & Answers- DBMS https://career.guru99.com/top-50-database-interview-questions/ 1) Define Database. A prearranged collection of figures known as data is called database. 2) What is DBMS? Database

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information