Big Data Processing Technologies. Chentao Wu Associate Professor Dept. of Computer Science and Engineering

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn

Schedule (1) Storage system part (first eight weeks) lec1: Introduction on big data and cloud computing Iec2: Introduction on data storage lec3: Data reliability (Replication/Archive/EC) lec4: Data consistency problem lec5: Block level storage and file storage lec6: Object-based storage lec7: Distributed file system lec8: Metadata management

Schedule (2) Reading & Project part (middle two/three weeks) Database part (last five weeks) lec9: Introduction on database lec10: Relational database (SQL) lec11: Relational database (NoSQL) lec12: Distributed database lec13: Main memory database

Collaborators

Distributed vs. Parallel? Parallel DBMSs Shared-memory Shared-disk Shared-nothing Distributed is basically shared-nothing parallel Perhaps with a slower network

What s Special About Distributed Computing? Parallel computation No shared memory/disk Unreliable Networks Delay, reordering, loss of packets Unsynchronized clocks Impossible to have perfect synchrony Partial failure: can t know what s up, what s down A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. Leslie Lamport, Turing 2013

Distributed Database Systems DBMS an influential special case of distributed computing The trickiest part of distributed computing is state, i.e. Data Transactions provide an influential model for concurrency/parallelism DBMSs worried about fault handling early on Special-case because not all programs are written transactionally And if not, database techniques may not apply Many of today s most complex distributed systems are databases Cloud SQL databases like Spanner, Aurora, Azure SQL NoSQL databases like DynamoDB, Cassandra, MongoDB, Couchbase We ll focus on concurrency control and recovery You already know many lessons of distributed query processing

Distributed Concurrency Control Consider a shared-nothing or distributed DBMS For today, assume partitioning but no replication of data Each transaction arrives at some node: The coordinator for the transaction T 1

Where does the Lock Table go?

Where does the Lock Table go? Typical design: Locks partitioned with the data Independent: each node manages its own lock table Works for objects that fit on one node (pages, tuples)

Where is the Lock Table Typical design: Locks partitioned with the data Independent: each node manages its own lock table Works for objects that fit on one node (pages, tuples) For coarser-grained locks, assign a home node Object being locked (table, DB) exists across nodes Boats Reserves Sailors

Distributed voting? How? Vote for Commitment How many votes does a commit need to win? ALL of them (unanimous!) How do we implement distributed voting?! In the face of message/node failure/delay? T 1

Distributed database hadoop Hbase: Google s BigTable was first blob-based storage system Yahoo! Open-sourced it -> HBase Major Apache project today Facebook uses HBase internally API Get/Put(row) Scan(row range, filter) range queries MultiPut

HBase Architecture Small group of servers running Zab, a Paxos-like protocol HDFS

HBase Storage hierarchy HBase Table Split it into multiple regions: replicated across servers One Store per ColumnFamily (subset of columns with similar query patterns) per region Memstore for each Store: in-memory updates to Store; flushed to disk when full StoreFiles for each store for each region: where the data lives - Blocks HFile SSTable from Google s BigTable

HFile (For a census table example) SSN:000-00-0000 Demographic Ethnicity

Strong Consistency: HBase Write-Ahead Log Write to HLog before writing to MemStore Can recover from failure

Log Replay After recovery from failure, or upon bootup (HRegionServer/HMaster) Replay any stale logs (use timestamps to find out where the database is w.r.t. the logs) Replay: add edits to the MemStore Why one HLog per HRegionServer rather than per region? Avoids many concurrent writes, which on the local file system may involve many disk seeks

Cross-data center replication Zookeeper actually a file system for control information 1. /hbase/replication/st ate 2. /hbase/replication/p eers /<peer cluster number> 3. /hbase/replication/rs /<hlog>

Amazon DynamoDB Scalable Dynamo architecture Reliable Replicas over multiple data centers Speed Fast, single-digit milliseconds Secure Weak schema

Amazon DynamoDB

Data Model table Container, similar to a worksheet in excel, Cannot query across domains Item Item name item name ->(Attribute, value) pairs An item is stored in a domain (a row in a worksheet. Attributes are column names) Example domain: cars Item 1: car1 :{ make : BMW, year : 2009 }

table

Partition keys

Primary key of table Single key (hash) Hash-range key A pair of attributes: first one is hash key, 2 nd one is range key. Example: Reply(Id, datetime, ) Data type Simple: string and number Multi-valued: string set and number set

Access methods Amazon DynamoDB is a web service that uses HTTP and HTTPS as the transport method JavaScript Object Notation (JSON) as a message serialization format APIs Java, PHP,.Net Boto

CloudFront For content delivery: distribute content to end users with a global network of edge locations. Edges : servers close to user s geographical location Objects are organized into distributions Each distribution has a domain name Distributions are stored in a S3 bucket

Use cases Hosting your most frequently accessed website components Small pieces of your website are cached in the edge locations, and are ideal for Amazon CloudFront. Distributing software distribute applications, updates or other downloadable software to end users. Publishing popular media files If your application involves rich media audio or video that is frequently accessed

Simple Queue Service Store messages traveling between computers Make it easy to build automated workflows Implemented as a web service read/add messages easily Scalable to millions of messages a day

Some features Message body : <8Kb in any format Message is retained in queues for up to 4days Messages can be sent and read simultaneously Can be locked, keeping from simultaneous processing Accessible with SOAP/REST Simple: Only a few methods Secure sharing

A typical workflow

Workflow with AWS

Conclusion: A new horizontally scalable distributed key-value store complying with stringent performance requirements was developed Dynamo was more transparent in the way system worked, rather than being a black box in comparison to relational database. Application developers had more flexibility and control over the system; to tune parameters to best suite needs of their application Emphasis on the increasing importance of availability, performance over

Google:BigTable Introduction Development began in 2004 at Google (published 2006) A need to store/handle large amounts of (semi)-structured data Many Google projects store data in BigTable

Goals of BigTable: Asynchronous processing across continuously evolving data Petabytes in size High volume of concurrent reading/writing spanning many CPUs Need ability to conduct analysis across many subsets of data Temporal analysis (e.g. how to anchors or content change over time?) Can work well with many

BigTable in a Nutshell Distributed multi-level map Fault-tolerant Scalable Thousands of servers Terabytes of memory-based data Petabytes of disk-based data Millions of reads/writes per second Self-managing Dynamic server management

Building Blocks Google File System is used for BigTable s storage Scheduler assigns jobs across many CPUs and watches for failures Lock service distributed lock manager MapReduce is often used to read/write data to BigTable BigTable can be an input or output

Data Model :Example: Web Indexing Semi Three Dimensional datacube Input(row, column, timestamp) Output(cell contents)

Row

Columns

Cells

timestamps

Column family

Column family family:qualifier

Data Model - Timestamps Used to store different version of data in a cell New writes default to current time Lookup options: Return most recent K values Return all values in the timestamp range

System Structure Bigtable Cell Bigtable Master Performs metadata ops+ Load balancing metadata Bigtable client Bigtable client library read/write Bigtable tablet server Bigtable tablet server Bigtable tablet server Open() Serves data Serves data Serves data Cluster scheduling system GFS Lock service Handles failover, monitoring Holds tablet data, logs Holds metadata Handles master-election

Locating Tablets Metadata for tablet locations and start/end row are stored in a special Bigtable cell

Reading/Writing to Tablets Write commands First write command gets put into a queue/log for commands on that tablet Data is written to GFS and when this write command is committed, queue is updated Mirror this write on the tablet s buffer memory Read commands Must combine the buffered commands not yet committed with the data in GFS

API Metadata operations Create and delete tables, column families, change metadata Writes (atomic) Set(): write cells in a row DeleteCells(): delete cells in a row DeleteRow(): delete all cells in a row Reads Scanner: read arbitrary cells in BigTable Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from one row, all rows, a subset of rows, etc. Can ask for all columns, just certainly column families, or specific columns

Compression Low CPU cost compression techniques are adopted Complete across each SSTable for a locality group Used BMDiff and Zippy building blocks of compression Keys: sorted strings of (Row, Column, Timestamp) Values Grouped by type/column family name BMDiff across all values in one family Zippy as final pass over a whole block Catches more localized repetitions Also catches cross-column family repetition Compression at a factor of 10 from empirical results

Big Data Processing Technologies Chentao Wu Associate Professor Dept. of Computer Science and Engineering wuct@cs.sjtu.edu.cn

Reference: https://aws.amazon.com/cn/documentation/dynamodb/ https://aws.amazon.com/cn/dynamodb/ https://en.wikipedia.org/wiki/amazon_dynamodb Chang F, Dean J, Ghemawat S, et al. Bigtable: a distributed storage system for structured data[c]// Symposium on Operating Systems Design and Implementation. USENIX Association, 2006:205-218. MapReduce/Bigtable for Distributed Optimization https://pt.wikipedia.org/wiki/drbd https://hbase.apache.org/apache_hbase_reference_guide.pdf https://phoenix.apache.org/presentations/oc-hug-2014-10-4x3.pdf https://www.cloudera.com/documentation/enterprise/5-9-x/pdf/clouderahbase.pdf www.cs.utexas.edu/~dsb/cs386d/projects14/hbase.pdf https://openproceedings.org/2016/conf/edbt/paper-298.pdf https://d0.awsstatic.com/whitepapers/cassandra_on_aws.pdf https://www.tutorialspoint.com/cassandra/cassandra_tutorial.pdf

Thank you!