QUERYING (BIG) DATA ON NOSQL STORES

Size: px

Start display at page:

Download "QUERYING (BIG) DATA ON NOSQL STORES"

Annabelle Merilyn Logan
5 years ago
Views:

1 QUERYING (BIG) DATA ON NOSQL STORES GENOVEVA VARGAS SOLAR FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE

2 DATA PROCESSING Process the data to produce other data: analysis tool, business intelligence tool,... This means Handle large volumes of data Manage thousands of processors Parallelize and distribute treatments Scheduling I/O Managing Fault Tolerance Monitor /Control processes Map-Reduce provides all this easy! 2

3 NOSQL QUERY EXECUTION 2 Query programming À Map Query Reduce MAP-REDUCE PROGRAMMING PATTERNS 3 Query execution - EXECUTION MODEL - DATA PROCESSING PROPERTIES all'updates' made'to'the'master' Optimizing query processing à DATA COMPRESSION, LOCATION, R/W STRATEGIES, CACHE/MEMORY ORGANIZATION Master' changes'propagate'' To'slaves' NoSQL reads'can'be'done' from'master'or'slaves' 1 Data organization - DISTRIBUTED FILE SYSTEM - INDEXING Slaves' 3

4 ORGANISATION DE DONNÉES: THE NOSQL CASE 4

5 NOSQL STORES: DATA MANAGEMENT PROPERTIES Indexing Distributed hashing like Memcached open source cache In-memory indexes are scalable when distributing and replicating objects over multiple nodes Partitioned tables High availability and scalability: eventual consistency Data fetched are not guaranteed to be up-to-date Updates are guaranteed to be propagated to all nodes eventually Shared nothing horizontal scaling Replicating and partitioning data over many servers Support large number of simple read/write operations per second (OLTP) No ACID guarantees Updates eventually propagated but limited guarantees on reads consistency BASE: basically available; soft state, eventually consistent Multi-version concurrency control 5

Brewer, "Towards robust distributed systems." PODC.

6 PROBLEM STATEMENT: HOW MUCH TO GIVE UP? Fault- tolerant partitioning Availability Consistency CAP theorem 1 : a system can have two of the three properties NoSQL systems sacrifice consistency 1 Eric Brewer, "Towards robust distributed systems." PODC /PODC- keynote.pdf 6

availability Faster query evaluation Distribute different data across multiple

7 NOSQL STORES: AVAILABILITY AND PERFORMANCE Replication Sharding Copy data across multiple servers (each bit of data can be found in multiple servers) Increase data availability Faster query evaluation Distribute different data across multiple servers Each server acts as the single source of a data subset Orthogonal techniques 7

8 REPLICATION: PROS & CONS Data is more available Failure of a site containing E does not result in unavailability of E if replicas exist Performance Parallelism: queries processed in parallel on several nodes Reduce data transfer for local data Increased updates cost Synchronisation: each replica must be updated Increased complexity of concurrency control Concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented 8

9 SHARDING: WHY IS IT USEFUL? Scaling applications by reducing data sets in any single databases Segregating data Sharing application data Securing sensitive data by isolating it Load%balancer% Web%3% Cache%1% Improve read and write performance Smaller amount of data in each user group implies faster querying Isolating data into smaller shards accessed data is more likely to stay on cache More write bandwidth: writing can be done in parallel Smaller data sets are easier to backup, restore and manage Massively work done Parallel work: scale out across more nodes Web%1% Web%2% Cache%2% Parallel backend: handling higher user loads Share nothing: very few bottlenecks Decrease resilience improve availability If a box goes down others still operate Cache%3% But: Part of the data missing MySQL% Master% MySQL% Resume%database% Master% Site%database% 9

10 SHARDING AND REPLICATION Sharding with no replication: unique copy, distributed data sets (+) Better concurrency levels (shards are accessed independently) (-) Cost of checking constraints, rebuilding aggregates Ensure that queries and updates are distributed across shards Replication of shards (+) Query performance (availability) (-) Cost of updating, of checking constraints, complexity of concurrency control Partial replication (most of the times) Only some shards are duplicated 10

11 QUERY PROGRAMMING DATA PROCESSING USING MAP-REDUCE 11

12 MAP-REDUCE Programming model for expressing distributed computations on massive amounts of data Execution framework for large-scale data processing on clusters of commodity servers Market: any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing content must tackle large-data problems data- intensive processing is beyond the capability of any individual machine and requires clusters large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of machines «Data represent the rising tide that lifts all boats more data lead to better algorithms and systems for solving real-world problems» 12

13 COUNTING WORDS (URI, document) à (term, count) see bob throw see spot run see 1 bob 1 throw 1 see 1 spot 1 run 1 bob <1> run <1> see <1,1> spot <1> throw <1> bob 1 run 1 see 2 spot 1 throw 1 Map Shuffle/Sort Reduce 13

14 MAP REDUCE DESIGN PATTERNS SUMMARIZATION Numerical Minimum, maximum, count, average, median-standard deviation Inverted index Wikipedia inverted index Counting with counters Count number of records, a small number of unique instances, summations Number of users per state FILTERING Filtering Closer view of data, tracking event threads, distributed grep, data cleansing, simple random sampling, remove low scoring data Bloom Remove most of nonwatched values, prefiltering data for a set membership check Hot list, Hbase query Top ten Outlier analysis, select interesting data, catchy dashbords Top ten users by reputation Distinct Deduplicate data, getting distinct values, protecting from inner join explosion Distinct user ids DATA ORGANIZATION Structured to hierarchical Prejoining data, preparing data for Hbase or MongoDB Post/comment building for StackOverflow, Question/Answer building Partitioning Partitioning users by last access date Binning Binning by Hadoop-related tags Total order sorting Sort users by last visit Shuffling Anonymizing StackOverflow comments JOIN Reduce side join Multiple large data sets joined by foreign key User comment join Reduce side join with bloom filter Reputable user comment join Replicated join Replicated user comment join Composite join Composite user comment join Cartesian product Comment comparison 14

15 ELEMENTS TO THINK ABOUT EFFICIENT EXECUTION 15

16 HADOOP INFRASTRUCTURE 16

17 HADOOP FRAMEWORK Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters HBase: A scalable, distributed database that supports structured data storage for large tables Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying Chukwa: A data collection system for managing large distributed systems Pig: A high-level data-flow language and execution framework for parallel computation ZooKeeper: A high-performance coordination service for distributed applications 17

18 DISTRIBUTED FILE SYSTEM Abandons the separation of computation and storage as distinct components in a cluster Google File System (GFS) supports Google s proprietary implementation of MapReduce; In the open-source world, HDFS (Hadoop Distributed File System) is an open-source implementation of GFS that supports Hadoop The main idea is to divide user data into blocks and replicate those blocks across the local disks of nodes in the cluster Adopts a master slave architecture Master (namenode HDFS) maintains the file namespace (metadata, directory structure, file to block mapping, location of blocks, and access permissions) Slaves (datanode HDFS) manage the actual data blocks 18

HFDS GENERAL ARCHITECTURE An application client wishing to read a file (or a portion thereof) must first contact the namenode to determine where the actual data is stored The namenode returns the

19 HFDS GENERAL ARCHITECTURE An application client wishing to read a file (or a portion thereof) must first contact the namenode to determine where the actual data is stored The namenode returns the relevant block id and the location where the block is held (i.e., which datanode) The client then contacts the datanode to retrieve the data. HDFS lies on top of the standard OS stack (e.g., Linux): blocks are stored on standard single-machine file systems 19

20 HDFS PROPERTIES HDFS stores three separate copies of each data block to ensure both reliability, availability, and performance In large clusters, the three replicas are spread across different physical racks, HDFS is resilient towards two common failure scenarios individual datanode crashes and failures in networking equipment that bring an entire rack offline. Replicating blocks across physical machines also increases opportunities to co-locate data and processing in the scheduling of MapReduce jobs, since multiple copies yield more opportunities to exploit locality To create a new file and write data to HDFS The application client first contacts the namenode The namenode updates the file namespace after checking permissions and making sure the file doesn t already exist allocates a new block on a suitable datanode The application is directed to stream data directly to it From the initial datanode, data is further propagated to additional replicas 20

HADOOP CLUSTER ARCHITECTURE The HDFS namenode runs the namenode daemon The job submission node runs the jobtracker, which is the single point of contact for a client wishing to execute a MapReduce

21 HADOOP CLUSTER ARCHITECTURE The HDFS namenode runs the namenode daemon The job submission node runs the jobtracker, which is the single point of contact for a client wishing to execute a MapReduce job The jobtracker Monitors the progress of running MapReduce jobs Is responsible for coordinating the execution of the mappers and reducers Tries to take advantage of data locality in scheduling map tasks 21

MAP-REDUCE PHASES Initialisation Map: record reader, mapper, combiner, and partitioner Reduce: shuffle, sort, reducer, and output format Partition input (key, value) pairs into chunks

22 MAP-REDUCE PHASES Initialisation Map: record reader, mapper, combiner, and partitioner Reduce: shuffle, sort, reducer, and output format Partition input (key, value) pairs into chunks run map() tasks in parallel After all map() s have been completed consolidate the values for each unique emitted key Partition space of output map keys, and run reduce() in parallel 22

23 MAP SUB-PHASES Record reader translates an input split generated by input format into records parse the data into records, but not parse the record itself It passes the data to the mapper in the form of a key/value pair. Usually the key in this context is positional information and the value is the chunk of data that composes a record Map user-provided code is executed on each key/value pair from the record reader to produce zero or more new key/value pairs, called the intermediate pairs The key is what the data will be grouped on and the value is the information pertinent to the analysis in the reducer Combiner, an optional localized reducer Can group data in the map phase It takes the intermediate keys from the mapper and applies a user-provided method to aggregate values in the small scope of that one mapper Partitioner takes the intermediate key/value pairs from the mapper (or combiner) and splits them up into shards, one shard per reducer 23

24 REDUCE SUB PHASES Shuffle and sort takes the output files written by all of the partitioners and downloads them to the local machine in which the reducer is running. These individual data pieces are then sorted by key into one larger data list The purpose of this sort is to group equivalent keys together so that their values can be iterated over easily in the reduce task Reduce takes the grouped data as input and runs a reduce function once per key grouping The function is passed the key and an iterator over all of the values associated with that key Once the reduce function is done, it sends zero or more key/value pair to the final step, the output format Output format translates the final key/value pair from the reduce function and writes it out to a file by a record writer 24

25 CASE STUDY 25

26 EXTENSIBLE RECORD STORES Basic data model is rows and columns Basic scalability model is splitting rows and columns over multiple nodes SYSTEM ADDRESS Rows split across nodes through sharding on the primary key Split by range rather than hash function HBase hbase.apache.com Rows analogous to documents: variable number of attributes, attribute names must be unique HyperTable hypertable.org Grouped into collections (tables) Cassandra incubator.apache.org/cassandra Queries on ranges of values do not go to every node Columns are distributed over multiple nodes using column groups Which columns are best stored together Column groups must be pre-defined with the extensible record stores 26

27 EXTENSIBLE RECORD DATA MODEL (HBASE EXAMPLE) Most basic unit: column Each column may have multiple versions Each distinct value contained in a separate cell One or more columns form a row addressed uniquely by a row key Table T1 Family F- 1 Raw R- 1 Column C1 C2 Cell Version 1 Version 2 A number of rows form a table Raw R- n Family F- 2 Column Cell C C3 95 Version 1 Version 2 27

28 DATA ORGANIZATION 28

29 REFINEMENTS: LOCALITY GROUPS Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group.

30 REFINEMENTS: COMPRESSION Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1)

31 FILTER Given a collection of tuples, filtering simply evaluates each record separately and decides, based on some condition, whether it should stay or go Scan through a file line-by-line and only output lines that match a specific pattern Simple random sampling: grab a subset of our larger data set in which each record has an equal probability of being selected (decrease the dataset size) Instead of some filter criteria function that bears some relationship to the content of the record, a random number generator will produce a value, and if the value is below a threshold, keep the record. Otherwise, toss it out Bloom: keep records that are member of some predefined set of values (hot values) For each record, extract a feature of that record. If that feature is a member of a set of values represented by a Bloom filter, keep it; otherwise toss it out (or the reverse). For example: keep or throw away this record if the value in the user field is a member of a predetermined list of users. 31

32 BLOOM FILTER Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set The base data structure of a Bloom filter is a Bit Vector. Here's a small one we'll use to demonstrate Each empty cell in that table represents a bit, and the number below it its index. To add an element to the Bloom filter, we simply hash it a few times and set the bits in the bit vector at the index of those hashes to 1 tutorial/ 32

33 A FORM OF OPTIMIZATION FOR ACCESSING HBASE SEMI-HANDS ON (SEE EXERCISE 4) 33

34 REFINEMENTS: BLOOM FILTERS Read operation has to read from disk when desired SSTable isn t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk

NOSQL DATA PROCESSING PROPERTIES only an afterthought and could cause problems once you need to scale the system. And if it does offer scalability, does it imply specific steps to do so?

35 NOSQL DATA PROCESSING PROPERTIES only an afterthought and could cause problems once you need to scale the system. And if it does offer scalability, does it imply specific steps to do so? The easiest solution would be to add one machine at a time, while sharded setups (especially those not supporting virtual shards) sometimes require for each shard to be in- creased simultaneously because each partition needs to be equally powerful. Lars George; Hbase the definitive guide, O Reilly 35

36 NOSQL DATA PROCESSING PROPERTIES 36

37 37

SOME BOOKS Hadoop The Definitive Guide O Reily 2011 Tom White Data Intensive Text Processing with MapReduce Morgan & Claypool 2010 Jimmy Lin, Chris Dyer pages 37-65 Cloud Computing and Software

38 SOME BOOKS Hadoop The Definitive Guide O Reily 2011 Tom White Data Intensive Text Processing with MapReduce Morgan & Claypool 2010 Jimmy Lin, Chris Dyer pages Cloud Computing and Software Services Theory and Techniques CRC Press Syed Ahson, Mohammad Ilyas pages Writing and Querying MapReduce Views in CouchDB O Reily 2011 Brandley Holt pages 5-29 NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence by Pramod J. Sadalage, Martin Fowler 38

39 NOSQL STORES: AVAILABILITY AND PERFORMANCE 39

40 REPLICATION MASTER - SLAVE Master' all'updates' made'to'the'master' changes'propagate'' To'slaves' reads'can'be'done' from'master'or'slaves' Helps with read scalability but does not help with write scalability Read resilience: should the master fail, slaves can still handle read requests Master failure eliminates the ability to handle writes until either the master is restored or a new master is appointed Biggest complication is consistency Slaves' Makes one node the authoritative copy/replica that handles writes while replica synchronize with the master and may handle reeds All replicas have the same weight Possible write write conflict Attempt to update the same record at the same time from to different places Master is a bottle-neck and a point of failure Replicas can all accept writes The lose of one of them does not prevent access to the data store 40

41 MASTER-SLAVE REPLICATION MANAGEMENT Masters can be appointed Manually when configuring the nodes cluster Automatically: when configuring a nodes cluster one of them elected as master. The master can appoint a new master when the master fails reducing downtime Read resilience Read and write paths have to be managed separately to handle failure in the write path and still reads can occur Reads and writes are put in different database connections if the database library accepts it Replication comes inevitably with a dark side: inconsistency Different clients reading different slaves will see different values if changes have not been propagated to all slaves In the worst case a client cannot read a write it just made Even if master-slave is used for hot backups, if the master fails any updates on to the backup are lost 41

42 REPLICATION: PEER-TO-PEER Master' Allows writes to any node; the nodes coordinate to synchronize their copies The replicas have equal weight nodes'communicate' their'writes' all'nodes'read' and'write'all'data' Deals with inconsistencies Replicas coordinate to avoid conflict Network traffic cost for coordinating writes Unnecessary to make all replicas agree to write, only the majority Survival to the loss of the minority of replicas nodes Policy to merge inconsistent writes Full performance on writing to any replica 42

43 REPLICATION: ASPECTS TO CONSIDER Conditioning Performance Fault tolerance Important elements to consider Data to duplicate Copies location Duplication model (master slave / P2P) Consistency model (global copies) Transparency levels Availability à Find a compromise! 43

44 SHARDING Puts different data on separate nodes Each user only talks to one servicer so she gets rapid responses The load should be balanced out nicely between servers Ensure that data that is accessed together is clumped together on the same node that clumps are arranged on the nodes to provide best data access Each%shard%reads%and% writes%its%own%data% Ability to distribute both data and load of simple operations over many servers, with no RAM or disk shared among servers A way to horizontally scale writes Improve read performance Application/data store support 44

45 SHARDING Database laws Small databases are fast Big databases are slow Keep databases small Principle Start with a big monolithic database Break into smaller databases Across many clusters Using a key value Instead of having one million customers information on a single big machine customers on smaller and different machines 45

46 SHARDING CRITERIA Partitioning Relational: handled by the DBMS (homogeneous DBMS) NoSQL: based on ranging of the k-value Federation Relational Combine tables stored in different physical databases Easier with denormalized data NoSQL: Store together data that are accessed together Aggregates unit of distribution 46

SHARDING Architecture Each application server (AS) is running DBS/client Each shard server is running a database server replication agents and query agents for supporting parallel query functionality

47 SHARDING Architecture Each application server (AS) is running DBS/client Each shard server is running a database server replication agents and query agents for supporting parallel query functionality Process Pick a dimension that helps sharding easily (customers, countries, addresses) Pick strategies that will last a long time as repartition/ re-sharding of data is operationally difficult This is done according to two different principles Partitioning: a partition is a structure that divides a space into tow parts Federation: a set of things that together compose a centralized unit but each individually maintains some aspect of autonomy Customers data is partitioned by ID in shards using an algorithm d to determine which shard a customer ID belongs to 47

48 48

49 PARTITIONING A PARTITION IS A STRUCTURE THAT DIVIDES A SPACE INTO TOW PARTS 49

50 BACKGROUND: DISTRIBUTED RELATIONAL DATABASES External schemas (views) are often subsets of relations (contacts in Europe and America) Access defined on subsets of relations: 80% of the queries issued in a region have to do with contacts of that region Relations partition Better concurrency level Fragments accessed independently Implications Check integrity constraints Rebuild relations 50

51 FRAGMENTATION Horizontal Groups of tuples of the same relation Budget < or >= Vertical Not disjoint are more difficult to manage Groups attributes of the same relation Separate budget from loc and pname of the relation project Hybrid 51

52 FRAGMENTATION: RULES Vertical Clustering Splitting Grouping elementary fragments Budget and location information in two relations Decomposing a relation according to affinity relationships among attributes Horizontal Tuples of the same fragment must be statistically homogeneous If t1 and t2 are tuples of the same fragment then t1 and t2 have the same probability of being selected by a query Keep important conditions Complete Every tuple (attribute) belongs to a fragment (without information loss) If tuples where budget >= are more likely to be selected then it is a good candidate Minimum If no application distinguishes between budget >= and budget < then these conditions are unnecessary 52

53 SHARDING: HORIZONTAL PARTITIONING The entities of a database are split into two or more sets (by row) In relational: same schema several physical bases/ servers Partition contacts in Europe and America shards where they zip code indicates where the will be found Efficient if there exists some robust and implicit way to identify in which partition to find a particular entity Last resort shard Needs to find a sharding function: modulo, round robin, hash partition, range - partition Load%balancer% Web%1% MySQL% Master% Web%2% Web%3% MySQL% Master% MySQL% Slave%1% MySQL% Slave%2% Cache%1% Cache%2% MySQL% Slave%n% Cache%3% MySQL% Slave%1% MySQL% Slave%2% MySQL% Slave%n% Odd%IDs% Even%IDs% 53

54 FEDERATION A FEDERATION IS A SET OF THINGS THAT TOGETHER COMPOSE A CENTRALIZED UNIT BUT EACH INDIVIDUALLY MAINTAINS SOME ASPECT OF AUTONOMY 54

FEDERATION: VERTICAL SHARDING Load%balancer% Principle Partition data according to their logical affiliation Put together data that are commonly accessed The search load for the large partitioned

55 FEDERATION: VERTICAL SHARDING Load%balancer% Principle Partition data according to their logical affiliation Put together data that are commonly accessed The search load for the large partitioned entity can be split across multiple servers (logical and physical) and not only according to multiple indexes in the same logical server Web%1% Web%2% Web%3% MySQL% Master% Cache%1% Cache%2% Cache%3% Different schemas, systems, and physical bases/ servers Shards the components of a site and not only data MySQL% Master% MySQL% Slave%1% Internal% user% MySQL% Slave%1% MySQL% Slave%2% MySQL% Slave%n% Resume%database% Site%database% 55

56 NOSQL STORES: PERSISTENCY MANAGEMENT 56

57 «MEMCACHED» «memcached» is a memory management protocol based on a cache: Uses the key-value notion Information is completly stored in RAM «memcached» protocol for: Creating, retrieving, updating, and deleting information from the database Applications with their own «memcached» manager (Google, Facebook, YouTube, FarmVille, Twitter, Wikipedia) 57

58 STORAGE ON DISC (1) For efficiency reasons, information is stored using the RAM: Work information is in RAM in order to answer to low latency requests Yet, this is not always possible and desirable Ø The process of moving data from RAM to disc is called "eviction ; this process is configured automatically for every bucket 58

59 STORAGE ON DISC (2) NoSQL servers support the storage of key-value pairs on disc: Persistency can be executed by loading data, closing and reinitializing it without having to load data from another source Hot backups loaded data are sotred on disc so that it can be reinitialized in case of failures Storage on disc the disc is used when the quantity of data is higher thant the physical size of the RAM, frequently used information is maintained in RAM and the rest es stored on disc 59

60 STORAGE ON DISC (3) Strategies for ensuring: Each node maintains in RAM information on the key-value pairs it stores. Keys: may not be found, or they can be stored in memory or on disc The process of moving information from RAM to disc is asynchronous: The server can continue processing new requests A queue manages requests to disc Ø In periods with a lot of writing requests, clients can be notified that the server is termporaly out of memory until information is evicted 60

61 NOSQL STORES: CONCURRENCY CONTROL 61

62 MULTI VERSION CONCURRENCY CONTROL (MVCC) Objective: Provide concurrent access to the database and in programming languages to implement transactional memory Problem: If someone is reading from a database at the same time as someone else is writing to it, the reader could see a half-written or inconsistent piece of data. Lock: readers wait until the writer is done MVCC: Each user connected to the database sees a snapshot of the database at a particular instant in time Any changes made by a writer will not be seen by other users until the changes have been completed (until the transaction has been committed When an MVCC database needs to update an item of data it marks the old data as obsolete and adds the newer version elsewhere à multiple versions stored, but only one is the latest Writes can be isolated by virtue of the old versions being maintained Requires (generally) the system to periodically sweep through and delete the old, obsolete data objects 62

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone