QUERYING (BIG) DATA ON NOSQL STORES
|
|
- Annabelle Merilyn Logan
- 5 years ago
- Views:
Transcription
1 QUERYING (BIG) DATA ON NOSQL STORES GENOVEVA VARGAS SOLAR FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
2 DATA PROCESSING Process the data to produce other data: analysis tool, business intelligence tool,... This means Handle large volumes of data Manage thousands of processors Parallelize and distribute treatments Scheduling I/O Managing Fault Tolerance Monitor /Control processes Map-Reduce provides all this easy! 2
3 NOSQL QUERY EXECUTION 2 Query programming À Map Query Reduce MAP-REDUCE PROGRAMMING PATTERNS 3 Query execution - EXECUTION MODEL - DATA PROCESSING PROPERTIES all'updates' made'to'the'master' Optimizing query processing à DATA COMPRESSION, LOCATION, R/W STRATEGIES, CACHE/MEMORY ORGANIZATION Master' changes'propagate'' To'slaves' NoSQL reads'can'be'done' from'master'or'slaves' 1 Data organization - DISTRIBUTED FILE SYSTEM - INDEXING Slaves' 3
4 ORGANISATION DE DONNÉES: THE NOSQL CASE 4
5 NOSQL STORES: DATA MANAGEMENT PROPERTIES Indexing Distributed hashing like Memcached open source cache In-memory indexes are scalable when distributing and replicating objects over multiple nodes Partitioned tables High availability and scalability: eventual consistency Data fetched are not guaranteed to be up-to-date Updates are guaranteed to be propagated to all nodes eventually Shared nothing horizontal scaling Replicating and partitioning data over many servers Support large number of simple read/write operations per second (OLTP) No ACID guarantees Updates eventually propagated but limited guarantees on reads consistency BASE: basically available; soft state, eventually consistent Multi-version concurrency control 5
6 PROBLEM STATEMENT: HOW MUCH TO GIVE UP? Fault- tolerant partitioning Availability Consistency CAP theorem 1 : a system can have two of the three properties NoSQL systems sacrifice consistency 1 Eric Brewer, "Towards robust distributed systems." PODC /PODC- keynote.pdf 6
7 NOSQL STORES: AVAILABILITY AND PERFORMANCE Replication Sharding Copy data across multiple servers (each bit of data can be found in multiple servers) Increase data availability Faster query evaluation Distribute different data across multiple servers Each server acts as the single source of a data subset Orthogonal techniques 7
8 REPLICATION: PROS & CONS Data is more available Failure of a site containing E does not result in unavailability of E if replicas exist Performance Parallelism: queries processed in parallel on several nodes Reduce data transfer for local data Increased updates cost Synchronisation: each replica must be updated Increased complexity of concurrency control Concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented 8
9 SHARDING: WHY IS IT USEFUL? Scaling applications by reducing data sets in any single databases Segregating data Sharing application data Securing sensitive data by isolating it Load%balancer% Web%3% Cache%1% Improve read and write performance Smaller amount of data in each user group implies faster querying Isolating data into smaller shards accessed data is more likely to stay on cache More write bandwidth: writing can be done in parallel Smaller data sets are easier to backup, restore and manage Massively work done Parallel work: scale out across more nodes Web%1% Web%2% Cache%2% Parallel backend: handling higher user loads Share nothing: very few bottlenecks Decrease resilience improve availability If a box goes down others still operate Cache%3% But: Part of the data missing MySQL% Master% MySQL% Resume%database% Master% Site%database% 9
10 SHARDING AND REPLICATION Sharding with no replication: unique copy, distributed data sets (+) Better concurrency levels (shards are accessed independently) (-) Cost of checking constraints, rebuilding aggregates Ensure that queries and updates are distributed across shards Replication of shards (+) Query performance (availability) (-) Cost of updating, of checking constraints, complexity of concurrency control Partial replication (most of the times) Only some shards are duplicated 10
11 QUERY PROGRAMMING DATA PROCESSING USING MAP-REDUCE 11
12 MAP-REDUCE Programming model for expressing distributed computations on massive amounts of data Execution framework for large-scale data processing on clusters of commodity servers Market: any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing content must tackle large-data problems data- intensive processing is beyond the capability of any individual machine and requires clusters large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of machines «Data represent the rising tide that lifts all boats more data lead to better algorithms and systems for solving real-world problems» 12
13 COUNTING WORDS (URI, document) à (term, count) see bob throw see spot run see 1 bob 1 throw 1 see 1 spot 1 run 1 bob <1> run <1> see <1,1> spot <1> throw <1> bob 1 run 1 see 2 spot 1 throw 1 Map Shuffle/Sort Reduce 13
14 MAP REDUCE DESIGN PATTERNS SUMMARIZATION Numerical Minimum, maximum, count, average, median-standard deviation Inverted index Wikipedia inverted index Counting with counters Count number of records, a small number of unique instances, summations Number of users per state FILTERING Filtering Closer view of data, tracking event threads, distributed grep, data cleansing, simple random sampling, remove low scoring data Bloom Remove most of nonwatched values, prefiltering data for a set membership check Hot list, Hbase query Top ten Outlier analysis, select interesting data, catchy dashbords Top ten users by reputation Distinct Deduplicate data, getting distinct values, protecting from inner join explosion Distinct user ids DATA ORGANIZATION Structured to hierarchical Prejoining data, preparing data for Hbase or MongoDB Post/comment building for StackOverflow, Question/Answer building Partitioning Partitioning users by last access date Binning Binning by Hadoop-related tags Total order sorting Sort users by last visit Shuffling Anonymizing StackOverflow comments JOIN Reduce side join Multiple large data sets joined by foreign key User comment join Reduce side join with bloom filter Reputable user comment join Replicated join Replicated user comment join Composite join Composite user comment join Cartesian product Comment comparison 14
15 ELEMENTS TO THINK ABOUT EFFICIENT EXECUTION 15
16 HADOOP INFRASTRUCTURE 16
17 HADOOP FRAMEWORK Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters HBase: A scalable, distributed database that supports structured data storage for large tables Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying Chukwa: A data collection system for managing large distributed systems Pig: A high-level data-flow language and execution framework for parallel computation ZooKeeper: A high-performance coordination service for distributed applications 17
18 DISTRIBUTED FILE SYSTEM Abandons the separation of computation and storage as distinct components in a cluster Google File System (GFS) supports Google s proprietary implementation of MapReduce; In the open-source world, HDFS (Hadoop Distributed File System) is an open-source implementation of GFS that supports Hadoop The main idea is to divide user data into blocks and replicate those blocks across the local disks of nodes in the cluster Adopts a master slave architecture Master (namenode HDFS) maintains the file namespace (metadata, directory structure, file to block mapping, location of blocks, and access permissions) Slaves (datanode HDFS) manage the actual data blocks 18
19 HFDS GENERAL ARCHITECTURE An application client wishing to read a file (or a portion thereof) must first contact the namenode to determine where the actual data is stored The namenode returns the relevant block id and the location where the block is held (i.e., which datanode) The client then contacts the datanode to retrieve the data. HDFS lies on top of the standard OS stack (e.g., Linux): blocks are stored on standard single-machine file systems 19
20 HDFS PROPERTIES HDFS stores three separate copies of each data block to ensure both reliability, availability, and performance In large clusters, the three replicas are spread across different physical racks, HDFS is resilient towards two common failure scenarios individual datanode crashes and failures in networking equipment that bring an entire rack offline. Replicating blocks across physical machines also increases opportunities to co-locate data and processing in the scheduling of MapReduce jobs, since multiple copies yield more opportunities to exploit locality To create a new file and write data to HDFS The application client first contacts the namenode The namenode updates the file namespace after checking permissions and making sure the file doesn t already exist allocates a new block on a suitable datanode The application is directed to stream data directly to it From the initial datanode, data is further propagated to additional replicas 20
21 HADOOP CLUSTER ARCHITECTURE The HDFS namenode runs the namenode daemon The job submission node runs the jobtracker, which is the single point of contact for a client wishing to execute a MapReduce job The jobtracker Monitors the progress of running MapReduce jobs Is responsible for coordinating the execution of the mappers and reducers Tries to take advantage of data locality in scheduling map tasks 21
22 MAP-REDUCE PHASES Initialisation Map: record reader, mapper, combiner, and partitioner Reduce: shuffle, sort, reducer, and output format Partition input (key, value) pairs into chunks run map() tasks in parallel After all map() s have been completed consolidate the values for each unique emitted key Partition space of output map keys, and run reduce() in parallel 22
23 MAP SUB-PHASES Record reader translates an input split generated by input format into records parse the data into records, but not parse the record itself It passes the data to the mapper in the form of a key/value pair. Usually the key in this context is positional information and the value is the chunk of data that composes a record Map user-provided code is executed on each key/value pair from the record reader to produce zero or more new key/value pairs, called the intermediate pairs The key is what the data will be grouped on and the value is the information pertinent to the analysis in the reducer Combiner, an optional localized reducer Can group data in the map phase It takes the intermediate keys from the mapper and applies a user-provided method to aggregate values in the small scope of that one mapper Partitioner takes the intermediate key/value pairs from the mapper (or combiner) and splits them up into shards, one shard per reducer 23
24 REDUCE SUB PHASES Shuffle and sort takes the output files written by all of the partitioners and downloads them to the local machine in which the reducer is running. These individual data pieces are then sorted by key into one larger data list The purpose of this sort is to group equivalent keys together so that their values can be iterated over easily in the reduce task Reduce takes the grouped data as input and runs a reduce function once per key grouping The function is passed the key and an iterator over all of the values associated with that key Once the reduce function is done, it sends zero or more key/value pair to the final step, the output format Output format translates the final key/value pair from the reduce function and writes it out to a file by a record writer 24
25 CASE STUDY 25
26 EXTENSIBLE RECORD STORES Basic data model is rows and columns Basic scalability model is splitting rows and columns over multiple nodes SYSTEM ADDRESS Rows split across nodes through sharding on the primary key Split by range rather than hash function HBase hbase.apache.com Rows analogous to documents: variable number of attributes, attribute names must be unique HyperTable hypertable.org Grouped into collections (tables) Cassandra incubator.apache.org/cassandra Queries on ranges of values do not go to every node Columns are distributed over multiple nodes using column groups Which columns are best stored together Column groups must be pre-defined with the extensible record stores 26
27 EXTENSIBLE RECORD DATA MODEL (HBASE EXAMPLE) Most basic unit: column Each column may have multiple versions Each distinct value contained in a separate cell One or more columns form a row addressed uniquely by a row key Table T1 Family F- 1 Raw R- 1 Column C1 C2 Cell Version 1 Version 2 A number of rows form a table Raw R- n Family F- 2 Column Cell C C3 95 Version 1 Version 2 27
28 DATA ORGANIZATION 28
29 REFINEMENTS: LOCALITY GROUPS Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group.
30 REFINEMENTS: COMPRESSION Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1)
31 FILTER Given a collection of tuples, filtering simply evaluates each record separately and decides, based on some condition, whether it should stay or go Scan through a file line-by-line and only output lines that match a specific pattern Simple random sampling: grab a subset of our larger data set in which each record has an equal probability of being selected (decrease the dataset size) Instead of some filter criteria function that bears some relationship to the content of the record, a random number generator will produce a value, and if the value is below a threshold, keep the record. Otherwise, toss it out Bloom: keep records that are member of some predefined set of values (hot values) For each record, extract a feature of that record. If that feature is a member of a set of values represented by a Bloom filter, keep it; otherwise toss it out (or the reverse). For example: keep or throw away this record if the value in the user field is a member of a predetermined list of users. 31
32 BLOOM FILTER Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to demonstrate Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1 tutorial/ 32
33 A FORM OF OPTIMIZATION FOR ACCESSING HBASE SEMI-HANDS ON (SEE EXERCISE 4) 33
34 REFINEMENTS: BLOOM FILTERS Read operation has to read from disk when desired SSTable isn t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk
35 NOSQL DATA PROCESSING PROPERTIES only an afterthought and could cause problems once you need to scale the system. And if it does offer scalability, does it imply specific steps to do so? The easiest solution would be to add one machine at a time, while sharded setups (especially those not supporting virtual shards) sometimes require for each shard to be in- creased simultaneously because each partition needs to be equally powerful. Lars George; Hbase the definitive guide, O Reilly 35
36 NOSQL DATA PROCESSING PROPERTIES 36
37 37
38 SOME BOOKS Hadoop The Definitive Guide O Reily 2011 Tom White Data Intensive Text Processing with MapReduce Morgan & Claypool 2010 Jimmy Lin, Chris Dyer pages Cloud Computing and Software Services Theory and Techniques CRC Press Syed Ahson, Mohammad Ilyas pages Writing and Querying MapReduce Views in CouchDB O Reily 2011 Brandley Holt pages 5-29 NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by Pramod J. Sadalage, Martin Fowler 38
39 NOSQL STORES: AVAILABILITY AND PERFORMANCE 39
40 REPLICATION MASTER - SLAVE Master' all'updates' made'to'the'master' changes'propagate'' To'slaves' reads'can'be'done' from'master'or'slaves' Helps with read scalability but does not help with write scalability Read resilience: should the master fail, slaves can still handle read requests Master failure eliminates the ability to handle writes until either the master is restored or a new master is appointed Biggest complication is consistency Slaves' Makes one node the authoritative copy/replica that handles writes while replica synchronize with the master and may handle reeds All replicas have the same weight Possible write write conflict Attempt to update the same record at the same time from to different places Master is a bottle-neck and a point of failure Replicas can all accept writes The lose of one of them does not prevent access to the data store 40
41 MASTER-SLAVE REPLICATION MANAGEMENT Masters can be appointed Manually when configuring the nodes cluster Automatically: when configuring a nodes cluster one of them elected as master. The master can appoint a new master when the master fails reducing downtime Read resilience Read and write paths have to be managed separately to handle failure in the write path and still reads can occur Reads and writes are put in different database connections if the database library accepts it Replication comes inevitably with a dark side: inconsistency Different clients reading different slaves will see different values if changes have not been propagated to all slaves In the worst case a client cannot read a write it just made Even if master-slave is used for hot backups, if the master fails any updates on to the backup are lost 41
42 REPLICATION: PEER-TO-PEER Master' Allows writes to any node; the nodes coordinate to synchronize their copies The replicas have equal weight nodes'communicate' their'writes' all'nodes'read' and'write'all'data' Deals with inconsistencies Replicas coordinate to avoid conflict Network traffic cost for coordinating writes Unnecessary to make all replicas agree to write, only the majority Survival to the loss of the minority of replicas nodes Policy to merge inconsistent writes Full performance on writing to any replica 42
43 REPLICATION: ASPECTS TO CONSIDER Conditioning Performance Fault tolerance Important elements to consider Data to duplicate Copies location Duplication model (master slave / P2P) Consistency model (global copies) Transparency levels Availability à Find a compromise! 43
44 SHARDING Puts different data on separate nodes Each user only talks to one servicer so she gets rapid responses The load should be balanced out nicely between servers Ensure that data that is accessed together is clumped together on the same node that clumps are arranged on the nodes to provide best data access Each%shard%reads%and% writes%its%own%data% Ability to distribute both data and load of simple operations over many servers, with no RAM or disk shared among servers A way to horizontally scale writes Improve read performance Application/data store support 44
45 SHARDING Database laws Small databases are fast Big databases are slow Keep databases small Principle Start with a big monolithic database Break into smaller databases Across many clusters Using a key value Instead of having one million customers information on a single big machine customers on smaller and different machines 45
46 SHARDING CRITERIA Partitioning Relational: handled by the DBMS (homogeneous DBMS) NoSQL: based on ranging of the k-value Federation Relational Combine tables stored in different physical databases Easier with denormalized data NoSQL: Store together data that are accessed together Aggregates unit of distribution 46
47 SHARDING Architecture Each application server (AS) is running DBS/client Each shard server is running a database server replication agents and query agents for supporting parallel query functionality Process Pick a dimension that helps sharding easily (customers, countries, addresses) Pick strategies that will last a long time as repartition/ re-sharding of data is operationally difficult This is done according to two different principles Partitioning: a partition is a structure that divides a space into tow parts Federation: a set of things that together compose a centralized unit but each individually maintains some aspect of autonomy Customers data is partitioned by ID in shards using an algorithm d to determine which shard a customer ID belongs to 47
48 48
49 PARTITIONING A PARTITION IS A STRUCTURE THAT DIVIDES A SPACE INTO TOW PARTS 49
50 BACKGROUND: DISTRIBUTED RELATIONAL DATABASES External schemas (views) are often subsets of relations (contacts in Europe and America) Access defined on subsets of relations: 80% of the queries issued in a region have to do with contacts of that region Relations partition Better concurrency level Fragments accessed independently Implications Check integrity constraints Rebuild relations 50
51 FRAGMENTATION Horizontal Groups of tuples of the same relation Budget < or >= Vertical Not disjoint are more difficult to manage Groups attributes of the same relation Separate budget from loc and pname of the relation project Hybrid 51
52 FRAGMENTATION: RULES Vertical Clustering Splitting Grouping elementary fragments Budget and location information in two relations Decomposing a relation according to affinity relationships among attributes Horizontal Tuples of the same fragment must be statistically homogeneous If t1 and t2 are tuples of the same fragment then t1 and t2 have the same probability of being selected by a query Keep important conditions Complete Every tuple (attribute) belongs to a fragment (without information loss) If tuples where budget >= are more likely to be selected then it is a good candidate Minimum If no application distinguishes between budget >= and budget < then these conditions are unnecessary 52
53 SHARDING: HORIZONTAL PARTITIONING The entities of a database are split into two or more sets (by row) In relational: same schema several physical bases/ servers Partition contacts in Europe and America shards where they zip code indicates where the will be found Efficient if there exists some robust and implicit way to identify in which partition to find a particular entity Last resort shard Needs to find a sharding function: modulo, round robin, hash partition, range - partition Load%balancer% Web%1% MySQL% Master% Web%2% Web%3% MySQL% Master% MySQL% Slave%1% MySQL% Slave%2% Cache%1% Cache%2% MySQL% Slave%n% Cache%3% MySQL% Slave%1% MySQL% Slave%2% MySQL% Slave%n% Odd%IDs% Even%IDs% 53
54 FEDERATION A FEDERATION IS A SET OF THINGS THAT TOGETHER COMPOSE A CENTRALIZED UNIT BUT EACH INDIVIDUALLY MAINTAINS SOME ASPECT OF AUTONOMY 54
55 FEDERATION: VERTICAL SHARDING Load%balancer% Principle Partition data according to their logical affiliation Put together data that are commonly accessed The search load for the large partitioned entity can be split across multiple servers (logical and physical) and not only according to multiple indexes in the same logical server Web%1% Web%2% Web%3% MySQL% Master% Cache%1% Cache%2% Cache%3% Different schemas, systems, and physical bases/ servers Shards the components of a site and not only data MySQL% Master% MySQL% Slave%1% Internal% user% MySQL% Slave%1% MySQL% Slave%2% MySQL% Slave%n% Resume%database% Site%database% 55
56 NOSQL STORES: PERSISTENCY MANAGEMENT 56
57 «MEMCACHED» «memcached» is a memory management protocol based on a cache: Uses the key-value notion Information is completly stored in RAM «memcached» protocol for: Creating, retrieving, updating, and deleting information from the database Applications with their own «memcached» manager (Google, Facebook, YouTube, FarmVille, Twitter, Wikipedia) 57
58 STORAGE ON DISC (1) For efficiency reasons, information is stored using the RAM: Work information is in RAM in order to answer to low latency requests Yet, this is not always possible and desirable Ø The process of moving data from RAM to disc is called "eviction ; this process is configured automatically for every bucket 58
59 STORAGE ON DISC (2) NoSQL servers support the storage of key-value pairs on disc: Persistency can be executed by loading data, closing and reinitializing it without having to load data from another source Hot backups loaded data are sotred on disc so that it can be reinitialized in case of failures Storage on disc the disc is used when the quantity of data is higher thant the physical size of the RAM, frequently used information is maintained in RAM and the rest es stored on disc 59
60 STORAGE ON DISC (3) Strategies for ensuring: Each node maintains in RAM information on the key-value pairs it stores. Keys: may not be found, or they can be stored in memory or on disc The process of moving information from RAM to disc is asynchronous: The server can continue processing new requests A queue manages requests to disc Ø In periods with a lot of writing requests, clients can be notified that the server is termporaly out of memory until information is evicted 60
61 NOSQL STORES: CONCURRENCY CONTROL 61
62 MULTI VERSION CONCURRENCY CONTROL (MVCC) Objective: Provide concurrent access to the database and in programming languages to implement transactional memory Problem: If someone is reading from a database at the same time as someone else is writing to it, the reader could see a half-written or inconsistent piece of data. Lock: readers wait until the writer is done MVCC: Each user connected to the database sees a snapshot of the database at a particular instant in time Any changes made by a writer will not be seen by other users until the changes have been completed (until the transaction has been committed When an MVCC database needs to update an item of data it marks the old data as obsolete and adds the newer version elsewhere à multiple versions stored, but only one is the latest Writes can be isolated by virtue of the old versions being maintained Requires (generally) the system to periodically sweep through and delete the old, obsolete data objects 62
CISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationModern Database Concepts
Modern Database Concepts Basic Principles Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz NoSQL Overview Main objective: to implement a distributed state Different objects stored on different
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationHadoop. copyright 2011 Trainologic LTD
Hadoop Hadoop is a framework for processing large amounts of data in a distributed manner. It can scale up to thousands of machines. It provides high-availability. Provides map-reduce functionality. Hides
More informationCISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL
CISC 7610 Lecture 5 Distributed multimedia databases Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL Motivation YouTube receives 400 hours of video per minute That is 200M hours
More informationBig Data and Scripting map reduce in Hadoop
Big Data and Scripting map reduce in Hadoop 1, 2, connecting to last session set up a local map reduce distribution enable execution of map reduce implementations using local file system only all tasks
More informationPLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS
PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationDatabase Architectures
Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/15/15 Agenda Check-in Parallelism and Distributed Databases Technology Research Project Introduction to NoSQL
More informationBigTable: A Distributed Storage System for Structured Data
BigTable: A Distributed Storage System for Structured Data Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) BigTable 1393/7/26
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationExtreme Computing. NoSQL.
Extreme Computing NoSQL PREVIOUSLY: BATCH Query most/all data Results Eventually NOW: ON DEMAND Single Data Points Latency Matters One problem, three ideas We want to keep track of mutable state in a scalable
More informationGoal of the presentation is to give an introduction of NoSQL databases, why they are there.
1 Goal of the presentation is to give an introduction of NoSQL databases, why they are there. We want to present "Why?" first to explain the need of something like "NoSQL" and then in "What?" we go in
More informationCA485 Ray Walshe NoSQL
NoSQL BASE vs ACID Summary Traditional relational database management systems (RDBMS) do not scale because they adhere to ACID. A strong movement within cloud computing is to utilize non-traditional data
More informationJargons, Concepts, Scope and Systems. Key Value Stores, Document Stores, Extensible Record Stores. Overview of different scalable relational systems
Jargons, Concepts, Scope and Systems Key Value Stores, Document Stores, Extensible Record Stores Overview of different scalable relational systems Examples of different Data stores Predictions, Comparisons
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationDistributed Systems 16. Distributed File Systems II
Distributed Systems 16. Distributed File Systems II Paul Krzyzanowski pxk@cs.rutgers.edu 1 Review NFS RPC-based access AFS Long-term caching CODA Read/write replication & disconnected operation DFS AFS
More informationApril Final Quiz COSC MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model.
1. MapReduce Programming a) Explain briefly the main ideas and components of the MapReduce programming model. MapReduce is a framework for processing big data which processes data in two phases, a Map
More informationCMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22
More informationIntroduction to MapReduce
732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server
More informationCLOUD-SCALE FILE SYSTEMS
Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients
More informationCSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores
CSE 444: Database Internals Lectures 26 NoSQL: Extensible Record Stores CSE 444 - Spring 2014 1 References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No. 4)
More informationBigTable. CSE-291 (Cloud Computing) Fall 2016
BigTable CSE-291 (Cloud Computing) Fall 2016 Data Model Sparse, distributed persistent, multi-dimensional sorted map Indexed by a row key, column key, and timestamp Values are uninterpreted arrays of bytes
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationDept. Of Computer Science, Colorado State University
CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,
More information4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS
W13.A.0.0 CS435 Introduction to Big Data W13.A.1 FAQs Programming Assignment 3 has been posted PART 2. LARGE SCALE DATA STORAGE SYSTEMS DISTRIBUTED FILE SYSTEMS Recitations Apache Spark tutorial 1 and
More informationCS November 2017
Bigtable Highly available distributed storage Distributed Systems 18. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationNoSQL systems: sharding, replication and consistency. Riccardo Torlone Università Roma Tre
NoSQL systems: sharding, replication and consistency Riccardo Torlone Università Roma Tre Data distribution NoSQL systems: data distributed over large clusters Aggregate is a natural unit to use for data
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationHADOOP FRAMEWORK FOR BIG DATA
HADOOP FRAMEWORK FOR BIG DATA Mr K. Srinivas Babu 1,Dr K. Rameshwaraiah 2 1 Research Scholar S V University, Tirupathi 2 Professor and Head NNRESGI, Hyderabad Abstract - Data has to be stored for further
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 10: Mutable State (1/2) March 15, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationDatabase Applications (15-415)
Database Applications (15-415) Hadoop Lecture 24, April 23, 2014 Mohammad Hammoud Today Last Session: NoSQL databases Today s Session: Hadoop = HDFS + MapReduce Announcements: Final Exam is on Sunday April
More informationA BigData Tour HDFS, Ceph and MapReduce
A BigData Tour HDFS, Ceph and MapReduce These slides are possible thanks to these sources Jonathan Drusi - SCInet Toronto Hadoop Tutorial, Amir Payberah - Course in Data Intensive Computing SICS; Yahoo!
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More informationBigtable. Presenter: Yijun Hou, Yixiao Peng
Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 06 Presenter: Yijun Hou, Yixiao Peng
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationNoSQL systems. Lecture 21 (optional) Instructor: Sudeepa Roy. CompSci 516 Data Intensive Computing Systems
CompSci 516 Data Intensive Computing Systems Lecture 21 (optional) NoSQL systems Instructor: Sudeepa Roy Duke CS, Spring 2016 CompSci 516: Data Intensive Computing Systems 1 Key- Value Stores Duke CS,
More informationDatabase Architectures
Database Architectures CPS352: Database Systems Simon Miner Gordon College Last Revised: 11/15/12 Agenda Check-in Centralized and Client-Server Models Parallelism Distributed Databases Homework 6 Check-in
More informationReferences. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals
References CSE 444: Database Internals Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol 39, No 4) Lectures 26 NoSQL: Extensible Record Stores Bigtable: A Distributed
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 10: Mutable State (1/2) March 14, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationDatacenter replication solution with quasardb
Datacenter replication solution with quasardb Technical positioning paper April 2017 Release v1.3 www.quasardb.net Contact: sales@quasardb.net Quasardb A datacenter survival guide quasardb INTRODUCTION
More informationBigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI Presented by Xiang Gao
Bigtable: A Distributed Storage System for Structured Data By Fay Chang, et al. OSDI 2006 Presented by Xiang Gao 2014-11-05 Outline Motivation Data Model APIs Building Blocks Implementation Refinement
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationCompSci 516 Database Systems
CompSci 516 Database Systems Lecture 20 NoSQL and Column Store Instructor: Sudeepa Roy Duke CS, Fall 2018 CompSci 516: Database Systems 1 Reading Material NOSQL: Scalable SQL and NoSQL Data Stores Rick
More informationCA485 Ray Walshe Google File System
Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage
More informationNoSQL Databases. Amir H. Payberah. Swedish Institute of Computer Science. April 10, 2014
NoSQL Databases Amir H. Payberah Swedish Institute of Computer Science amir@sics.se April 10, 2014 Amir H. Payberah (SICS) NoSQL Databases April 10, 2014 1 / 67 Database and Database Management System
More information18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationProgramming Models MapReduce
Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationBigtable: A Distributed Storage System for Structured Data. Andrew Hon, Phyllis Lau, Justin Ng
Bigtable: A Distributed Storage System for Structured Data Andrew Hon, Phyllis Lau, Justin Ng What is Bigtable? - A storage system for managing structured data - Used in 60+ Google services - Motivation:
More informationMapReduce. U of Toronto, 2014
MapReduce U of Toronto, 2014 http://www.google.org/flutrends/ca/ (2012) Average Searches Per Day: 5,134,000,000 2 Motivation Process lots of data Google processed about 24 petabytes of data per day in
More informationCS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs
11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.0.0 CS435 Introduction to Big Data 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.1 FAQs Deadline of the Programming Assignment 3
More informationA Glimpse of the Hadoop Echosystem
A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other
More informationBig Data Analytics. Rasoul Karimi
Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1 Outline
More informationCS November 2018
Bigtable Highly available distributed storage Distributed Systems 19. Bigtable Built with semi-structured data in mind URLs: content, metadata, links, anchors, page rank User data: preferences, account
More informationApache Hadoop Goes Realtime at Facebook. Himanshu Sharma
Apache Hadoop Goes Realtime at Facebook Guide - Dr. Sunny S. Chung Presented By- Anand K Singh Himanshu Sharma Index Problem with Current Stack Apache Hadoop and Hbase Zookeeper Applications of HBase at
More informationIntroduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data
Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction
More informationBigData and Map Reduce VITMAC03
BigData and Map Reduce VITMAC03 1 Motivation Process lots of data Google processed about 24 petabytes of data per day in 2009. A single machine cannot serve all the data You need a distributed system to
More informationIntroduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases
Introduction Aggregate data model Distribution Models Consistency Map-Reduce Types of NoSQL Databases Key-Value Document Column Family Graph John Edgar 2 Relational databases are the prevalent solution
More informationFLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568
FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 11. Advanced Aspects of Big Data Management Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/
More informationNOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY
NOSQL EGCO321 DATABASE SYSTEMS KANAT POOLSAWASD DEPARTMENT OF COMPUTER ENGINEERING MAHIDOL UNIVERSITY WHAT IS NOSQL? Stands for No-SQL or Not Only SQL. Class of non-relational data storage systems E.g.
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationGFS: The Google File System. Dr. Yingwu Zhu
GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationGhislain Fourny. Big Data 5. Wide column stores
Ghislain Fourny Big Data 5. Wide column stores Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage 2 Where we are User interfaces
More informationIntroduction to NoSQL Databases
Introduction to NoSQL Databases Roman Kern KTI, TU Graz 2017-10-16 Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 1 / 31 Introduction Intro Why NoSQL? Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 2 / 31 Introduction
More informationDistributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2017
Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2017 1 Google Chubby ( Apache Zookeeper) 2 Chubby Distributed lock service + simple fault-tolerant file system
More informationDistributed Systems. 15. Distributed File Systems. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 15. Distributed File Systems Paul Krzyzanowski Rutgers University Fall 2016 1 Google Chubby 2 Chubby Distributed lock service + simple fault-tolerant file system Interfaces File access
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2015 Lecture 14 NoSQL
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2015 Lecture 14 NoSQL References Scalable SQL and NoSQL Data Stores, Rick Cattell, SIGMOD Record, December 2010 (Vol. 39, No.
More informationVoldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data
More informationDistributed Systems. GFS / HDFS / Spanner
15-440 Distributed Systems GFS / HDFS / Spanner Agenda Google File System (GFS) Hadoop Distributed File System (HDFS) Distributed File Systems Replication Spanner Distributed Database System Paxos Replication
More informationΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing
ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent
More informationBigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13
Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University
More informationAdvanced Database Technologies NoSQL: Not only SQL
Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at
More informationIntroduction to Distributed Data Systems
Introduction to Distributed Data Systems Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart Web Data Management and Distribution http://webdam.inria.fr/textbook January
More informationCassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent
Tanton Jeppson CS 401R Lab 3 Cassandra, MongoDB, and HBase Introduction For my report I have chosen to take a deeper look at 3 NoSQL database systems: Cassandra, MongoDB, and HBase. I have chosen these
More informationCS /29/18. Paul Krzyzanowski 1. Question 1 (Bigtable) Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams
Question 1 (Bigtable) What is an SSTable in Bigtable? Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams It is the internal file format used to store Bigtable data. It maps keys
More informationHDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware
More informationCOSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan
COSC 416 NoSQL Databases NoSQL Databases Overview Dr. Ramon Lawrence University of British Columbia Okanagan ramon.lawrence@ubc.ca Databases Brought Back to Life!!! Image copyright: www.dragoart.com Image
More informationCS /30/17. Paul Krzyzanowski 1. Google Chubby ( Apache Zookeeper) Distributed Systems. Chubby. Chubby Deployment.
Distributed Systems 15. Distributed File Systems Google ( Apache Zookeeper) Paul Krzyzanowski Rutgers University Fall 2017 1 2 Distributed lock service + simple fault-tolerant file system Deployment Client
More informationBig Table. Google s Storage Choice for Structured Data. Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla
Big Table Google s Storage Choice for Structured Data Presented by Group E - Dawei Yang - Grace Ramamoorthy - Patrick O Sullivan - Rohan Singla Bigtable: Introduction Resembles a database. Does not support
More informationFinal Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23
Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE
More informationDistributed Systems Pre-exam 3 review Selected questions from past exams. David Domingo Paul Krzyzanowski Rutgers University Fall 2018
Distributed Systems 2018 Pre-exam 3 review Selected questions from past exams David Domingo Paul Krzyzanowski Rutgers University Fall 2018 November 28, 2018 1 Question 1 (Bigtable) What is an SSTable in
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationGhislain Fourny. Big Data 5. Column stores
Ghislain Fourny Big Data 5. Column stores 1 Introduction 2 Relational model 3 Relational model Schema 4 Issues with relational databases (RDBMS) Small scale Single machine 5 Can we fix a RDBMS? Scale up
More informationArchitekturen für die Cloud
Architekturen für die Cloud Eberhard Wolff Architecture & Technology Manager adesso AG 08.06.11 What is Cloud? National Institute for Standards and Technology (NIST) Definition On-demand self-service >
More informationNoSQL Databases. CPS352: Database Systems. Simon Miner Gordon College Last Revised: 4/22/15
NoSQL Databases CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/22/15 Agenda Check-in NoSQL Databases Aggregate databases Key-value, document, and column family Graph databases Related
More informationDIVING IN: INSIDE THE DATA CENTER
1 DIVING IN: INSIDE THE DATA CENTER Anwar Alhenshiri Data centers 2 Once traffic reaches a data center it tunnels in First passes through a filter that blocks attacks Next, a router that directs it to
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationIntroduction to BigData, Hadoop:-
Introduction to BigData, Hadoop:- Big Data Introduction: Hadoop Introduction What is Hadoop? Why Hadoop? Hadoop History. Different types of Components in Hadoop? HDFS, MapReduce, PIG, Hive, SQOOP, HBASE,
More informationOral Questions and Answers (DBMS LAB) Questions & Answers- DBMS
Questions & Answers- DBMS https://career.guru99.com/top-50-database-interview-questions/ 1) Define Database. A prearranged collection of figures known as data is called database. 2) What is DBMS? Database
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More information