Vineet Gupta GM Software Engineering Directi

Size: px

Start display at page:

Download "Vineet Gupta GM Software Engineering Directi"

Christian Johnson
6 years ago
Views:

1 Intelligent People. Uncommon Ideas. Vineet Gupta GM Software Engineering Directi Licensed under Creative Commons Attribution Sharealike Noncommercial

2 Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

3 Offline Processing (Batching / Queuing) Distributed Processing Map Reduce Non-blocking IO Fault Detection, Tolerance and Recovery

4 Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

5 22M+ users Dozens of DB servers Dozens of Web servers Six specialized graph database servers to run recommendations engine Source:

6 1 TB / Day 100 M blogs indexed / day 10 B objects indexed / day 0.5 B photos and videos Data doubles in 6 months Users double in 6 months Source:

7 2 PB Raw Storage 470 M photos, 4-5 sizes each 400 k photos added / day 35 M photos in Squid cache (total) 2 M photos in Squid RAM 38k reqs / sec to Memcached 4 B queries / day Source:

8 Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters 2 PB of data 26 B SQL queries / day 1 B page views / day 3 B API calls / month 15,000 App servers Source:

450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in 2005 200 GFS clusters (1 cluster = 1,000 5,000 machines) Read / write thruput = 40 GB /

9 450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in GFS clusters (1 cluster = 1,000 5,000 machines) Read / write thruput = 40 GB / sec across a cluster Map-Reduce 100k jobs / day 20 PB of data processed / day 10k MapReduce programs Source:

10 Data Size ~ PB Data Growth ~ TB / day No of servers 10s to 10,000 No of datacenters 1 to 10 Queries B+ / day Specialized needs more / other than RDBMS

11 Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

12 CPU CPU CPU RAM RAM RAM App Server DB Server Host

13 Sunfire E20k 36x 1.8GHz processors $450,000 - $2,500,000 PowerEdge SC1435 Dualcore 1.8 GHz processor Around $1,500

14 Increasing the hardware resources on a host Pros Simple to implement Fast turnaround time Cons Finite limit Hardware does not scale linearly (diminishing returns for each incremental unit) Requires downtime Increases Downtime Impact Incremental costs increase exponentially

15 App Server DB Server Host Host

16 Split services on separate nodes Each node performs different tasks Pros Increases per application Availability Task-based specialization, optimization and tuning possible Reduces context switching Simple to implement for out of band processes No changes to App required Flexibility increases Cons Sub-optimal resource utilization May not increase overall availability Finite Scalability

17 Web Server Load Balancer Web Server DB Server Web Server

18 Add more nodes for the same service Identical, doing the same task Load Balancing Hardware balancers are faster Software balancers are more customizable

19 Web Server User 1 User 2 Load Balancer Web Server DB Server Web Server

20 Web Server User 1 User 2 Load Balancer Web Server DB Server Asymmetrical load distribution Downtime Web Server

21 Web Server User 1 User 2 Load Balancer Web Server Session Store SPOF Reads and Writes generate network + disk IO Web Server

22 User 1 User 2 Load Balancer Web Server Web Server Web Server

23 Pros No SPOF Easier to setup Fast Reads Cons n x Writes Increase in network IO with increase in nodes Stale data (rare)

24 Web Server User 1 User 2 Load Balancer Web Server DB Server Web Server

25 No Sessions Stuff state in a cookie and sign it! Cookie is sent with every request / response Super Slim Sessions Keep small amount of frequently used data in cookie Pull rest from DB (or central session store)

26 Bad Sticky sessions Good Clustered sessions for small number of nodes and / or small write volume Central sessions for large number of nodes or large write volume Great No Sessions!

27 HTTP Accelerators / Reverse Proxy Static content caching, redirect to lighter HTTP Async NIO on user-side, Keep-alive connection pool CDN Get closer to your user Akamai, Limelight IP Anycasting Async NIO

28 App-Layer Add more nodes and load balance! Avoid Sticky Sessions Avoid Sessions!! Data Store Tricky! Very Tricky!!!

29 Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

30 App Layer T1, T2, T3, T4

31 App Layer T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 Each node has its own copy of data Shared Nothing Cluster

32 Read : Write = 4:1 Scale reads at cost of writes! Duplicate Data each node has its own copy Master Slave Writes sent to one node, cascaded to others Multi-Master Writes can be sent to multiple nodes Can lead to deadlocks Requires conflict management

33 App Layer Master Slave Slave Slave Slave n x Writes Async vs. Sync SPOF Async - Critical Reads from Master!

34 App Layer Master Master Slave Slave Slave n x Writes Async vs. Sync No SPOF Conflicts!

35 Asynchronous Guaranteed, but out-of-band replication from Master to Slave Master updates its own db and returns a response to client Replication from Master to Slave takes place asynchronously Faster response to a client Slave data is marginally behind the Master Requires modification to App to send critical reads and writes to master, and load balance all other reads Synchronous Guaranteed, in-band replication from Master to Slave Master updates its own db, and confirms all slaves have updated their db before returning a response to client Slower response to a client Slaves have the same data as the Master at all times Requires modification to App to send writes to master and load balance all reads

36 Replication at RDBMS level Support may exists in RDBMS or through 3rd party tool Faster and more reliable App must send writes to Master, reads to any db and critical reads to Master Replication at Driver / DAO level Driver / DAO layer ensures writes are performed on all connected DBs Reads are load balanced Critical reads are sent to a Master In most cases RDBMS agnostic Slower and in some cases less reliable

37 Per Server: Read Write Read Write Read Write 4R, 1W 2R, 1W 1R, 1W Read Read Read Read Write Write Write Write

38 Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

39 Vertical Partitioning Divide data on tables / columns Scale to as many boxes as there are tables or columns Finite Horizontal Partitioning Divide data on rows Scale to as many boxes as there are rows! Limitless scaling

40 App Layer T1, T2, T3, T4, T5 Note: A node here typically represents a shared nothing cluster

41 App Layer T1 T2 T3 T4 T5 Facebook - User table, posts table can be on separate nodes Joins need to be done in code (Why have them?)

42 App Layer First million rows T1 T2 T3 T4 T5 Second million rows T1 T2 T3 T4 T5 Third million rows T1 T2 T3 T4 T5

43 Value Based Split on timestamp of posts Split on first alphabet of user name Hash Based Use a hash function to determine cluster Lookup Map First Come First Serve Round Robin

44 Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

45 Consistency Availability Partition Tolerance Source:

46 Transactions make you feel alone No one else manipulates the data when you are Transactional serializability The behavior is as if a serial order exists Ta Tb Tc Te Td Tg Tf Ti Th Tj Tk Ti Doesn t Know About These Transactions and They Don t Know About Ti Tl Tn Tm To These Transactions Precede Ti Transaction Serializability Source: These Transactions Follow Ti Slide 46

47 Transactions live in the now inside services Time marches forward Transactions commit Advancing time Transactions see the committed transactions A service s biz-logic lives in the now Service Each Transaction Only Sees a Simple Advancing of Time with a Clear Set of Preceding Transactions Source: Slide 47

48 Messages contain unlocked data Assume no shared transactions Unlocked data may change Unlocking it allows change Messages are not from the now They are from the past There is no simultaneity at a distance! Similar to speed of light Knowledge travels at speed of light By the time you see a distant object it may have changed! By the time you see a message, the data may have changed! Services, transactions, and locks bound simultaneity! Inside a transaction, things appear simultaneous (to others) Simultaneity only inside a transaction! Simultaneity only inside a service! Source: Slide 48

49 All data from distant stars is from the past 10 light years away; 10 year old knowledge The sun may have blown up 5 minutes ago We won t know for 3 minutes more All data seen from a distant service is from the past By the time you see it, it has been unlocked and may change Each service has its own perspective Inside data is now ; outside data is past My inside is not your inside; my outside is not your outside This is like going from Newtonian to Einstonian physics Newton s time marched forward uniformly Instant knowledge Classic distributed computing: many systems look like one RPC, 2-phase commit, remote method calls In Einstein s world, everything is relative to one s perspective Today: No attempt to blur the boundary Source: Slide 49

50 Can t have the same data at many locations Unless it is a snapshot Changing distributed data needs versions Creates a snapshot Data Owning Service Wednesday s Tuesday s Wednesday s Wednesday s Tuesday s Wednesday s Listening Partner Service-8 Tuesday s Monday s Monday s Listening Partner Service-1 Listening Partner Service-5 Source: Listening Partner Service-7

51 Given what I know here and now, make a decision Remember the versions of all the data used to make this decision Record the decision as being predicated on these versions Other copies of the object may make divergent decisions Try to sort out conflicts within the family If necessary, programmatically apologize Very rarely, whine and fuss for human help Subjective Consistency Given the information I have at hand, make a decision and act on it! Remember the information at hand! Ambassadors Had Authority Back before radio, it could be months between communication with the king. Ambassadors would make treaties and much more... They had binding authority. The mess was sorted out later! Source:

Everyone has the same information, everyone comes to the same conclusion about the decisions to take Eventual Consistency Given the same knowledge, produce the

52 Eventually, all the copies of the object share their changes I ll show you mine if you show me yours! Now, apply subjective consistency: Given the information I have at hand, make a decision and act on it! Everyone has the same information, everyone comes to the same conclusion about the decisions to take Eventual Consistency Given the same knowledge, produce the same result! Everyone sharing their knowledge leads to the same result... This is NOT magic; it is a design requirement! Idempotence, commutativity, and associativity of the operations (decisions made) are all implied by this requirement Source:

53 Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

54 Normalization s Goal Is Eliminating Update Anomalies Can Be Changed Without Funny Behavior Each Data Item Lives in One Place Emp # Emp Name De-normalization is OK if you aren t going to update! Emp Phone Mgr # Mgr Name Mgr Phone 47 Joe Sam Sally Harry Pete Sam Mary Betty Classic problem with de-normalization Can t update Sam s phone # since there are many copies Source:

55 affiliations table affiliation_id description member_count 42 Microsoft 18,656 user table 598 Georgia Tech 23,488 user_affiliations table user_work_history first_ last_ table relati religi hom inter politi user_id onsh nam nam sex affiliation_id ous_ etow este cal_v _id ip_st view user_phone_numbers (foreign_key) e table user_screen_names (foreign n key) table d_in iews company_affil atus s user_id 42 company_na iation_id Atlan job_title (foreign_key) me user_id marr wom John user_id Doe (foreign Male 598 key) ta, (null) (null) (foreign_key) 5 phone_number phone_type screen_name (foreign_key) ied en im_service GA Program Microsoft Home geeknproud@exam Manager AIM Work ple.com Quality i Cell voip4life@example. Assurance Technologies Skype org Engineer

56 6 joins for 1 query! Do you think FB would do this? And how would you do joins with partitioned data? De-normalization removes joins But increases data volume But disk is cheap and getting cheaper And can lead to inconsistent data If you are lazy However this is not really an issue

57 Many Kinds of Computing are Append-Only Lots of observations are made about the world Debits, credits, Purchase-Orders, Customer-Change-Requests, etc As time moves on, more observations are added You can t change the history but you can add new observations Derived Results May Be Calculated Estimate of the current inventory Frequently inaccurate Historic Rollups Are Calculated Monthly bank statements

Transaction Logs Are the Truth High-performance

data Data-Base the Current Opinion Describes

the application The Database Is a Caching of

Log DB It is the subset of the latest committed

58 Transaction Logs Are the Truth High-performance & write-only Describe ALL the changes to the data Data-Base the Current Opinion Describes the latest value of the data as perceived by the application The Database Is a Caching of the Transaction Log! Log DB It is the subset of the latest committed values represented in the transaction log Source:

59 Listening Partner Service-1 Listening Partner Service-5 Listening Partner Service-7 Listening Partner Service-8 Tuesday s Wednesday s Wednesday s Wednesday s Monday s Tuesday s Wednesday s Monday s Tuesday s Data Owning Service Listening Partner Service-1 Listening Partner Service-5 Listening Partner Service-7 Listening Partner Service-8 Tuesday s Tuesday s Tuesday s Wednesday s Wednesday s Wednesday s Wednesday s Wednesday s Wednesday s Wednesday s Wednesday s Wednesday s Monday s Monday s Monday s Tuesday s Tuesday s Tuesday s Wednesday s Wednesday s Wednesday s Monday s Monday s Monday s Tuesday s Tuesday s Tuesday s Data Owning Service Data Owning Service Source:

60 Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

61 Makes scaling easier (cheaper) Core Idea Read data from persistent store into memory Store in a hash-table Read first from cache, if not, load from persistent store

62 App Server Cache

63 App Server Cache

64 App Server Cache

66 In-memory Distributed Hash Table Memcached instance manifests as a process (often on the same machine as web-server) Memcached Client maintains a hash table Which item is stored on which instance Memcached Server maintains a hash table Which item is stored in which memory location

67 Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

68 Amazon - S3, SimpleDb, Dynamo Google - App Engine Datastore, BigTable Microsoft SQL Data Services, Azure Storages Facebook Cassandra LinkedIn - Project Voldemort Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Hbase, Hypertable

69 Basic Concepts No tables - Containers-Entity No schema - each tuple has its own set of properties Amazon SimpleDB strings only Microsoft Azure SQL Data Services Strings, blob, datetime, bool, int, double, etc. No x-container joins as of now Google App Engine Datastore Strings, blob, datetime, bool, int, double, etc.

70 Google BigTable Sparse, Distributed, multi-dimensional sorted map Indexed by row key, column key, timestamp Each value is an un-interpreted array of bytes Amazon Dynamo Data partitioned and replicated using consistent hashing Decentralized replica sync protocol Consistency thru versioning Facebook Cassandra Used for Inbox search Open Source Scalaris Keys stored in lexicographical order Improved Paxos to provide ACID Memory resident, no persistence

71 Real Life Scaling requires trade offs No Silver Bullet Need to learn new things Need to un-learn Balance!

73 Intelligent People. Uncommon Ideas. Licensed under Creative Commons Attribution Sharealike Noncommercial

Building a Scalable Architecture for Web Apps - Part I (Lessons Directi)

Intelligent People. Uncommon Ideas. Building a Scalable Architecture for Web Apps - Part I (Lessons Learned @ Directi) By Bhavin Turakhia CEO, Directi (http://www.directi.com http://wiki.directi.com http://careers.directi.com)