S2Graph : A large-scale graph database

Size: px

Start display at page:

Download "S2Graph : A large-scale graph database"

Junior Atkins
6 years ago
Views:

1 daumkakao S2Graph : A large-scale graph database with Hbase Doyoung Yoon x Taejin Chin

2 DaumKakao A Mobile Lifestyle Platform 1. KakaoTalk a. Mobile Messenger replacing SMS b. KaTalkHe is being used as a verb in Korea like Googling c. 96% of Korean smartphone users are using KakaoTalk d. 170M users worldwide e. 3B messages / day 2

3 DaumKakao A Mobile Lifestyle Platform Social Platform Contents Platform Commerce Platform Marketing Platform Local Platform Personal Platform KakaoTalk Media Daum KakaoPick Yellow ID Daum Map KakaoHome KakaoStory KakaoGame Gift Shop Plus Friend KakaoPlace Sol calendar KakaoGroup Digital Item Store KakaoStyle Story Plus Sol Mail Daum Cafe KakaoTopic Daum Cluod 96% of Zap KakaoPage Korean smartphone Biggest mobile users are using KakaoTalk SNS in Korea Sol Group KakaoMusic messenger, 170 million users worldwide) Daum tvpot Daum Webtoon 3

4 Our Social Graph Advertise Listen count : Message Coupon price : affinity Emoticon affinity affinity: affinity affinity Like count : 7 Pick withfriend : Friend Write length : Play level: 6 View count : affinity affinity affinity affinity Comment Eat rating : affinity Read Search keyword : Group size : 6 Present price : 3 4

5 Our Social Graph Message ID : 201 Ad ID : 603 Advertise ctr : 0.32 Listen count : 6 Music ID Message length : 9 affinity 4 affinity 3 affinity: 9 affinity 6 affinity 9 Item ID : 13 Pick withfriend : 3 Friend affinity 1 Write length : 3 Post ID : 97 Play level: 6 Search keyword : HBase" affinity 3 affinity 2 affinity 2 affinity 3 C l Game ID :

6 Technical Challenges 1. Large social graph constantly changing a. Scale more than, social network: 10 billion edges, 200 million vertices, 50 million update on existing edges. user activities: 400 million new edges per day 6

7 Technical Challenges (cont) 2. Low latency for breadth first search traversal on connected data. a. performance requirement peak graph-traversing query per second: response time: 100ms 7

8 Technical Challenges (cont) 3. Update should be applied to graph in real time for viral effect Person A Fast Person B Fast Person C Fast Person D Post Comment Sharing Mention 8

9 Technical Challenges (cont) 4. Support for Dynamic Ranking logic a. push strategy: hard to change data ranking logic dynamically. b. pull strategy: can try various data ranking logic 9

10 Before Messaging App SNS App Blog App Friend relationship SNS feeds Blog user activities Messaging Each app server should know each DB s sharding logic. Highly inter-connected architecture 10

11 After Messaging App SNS App Blog App S2Graph DB stateless app servers 11

12 S2Graph : Distributed Online GraphDB 1.Low-latency 2.Graph-traversable 3.Scalable 4.Eventually consistent 5.Asynchronous, non-blocking 12

13 Why We Choose HBase? 1.High Availability 2.Scalability 3.Low latency 4.High concurrency 5.Fault tolerant 6.Integration with HDFS 7.Distributed operation 13

14 The Data Model vertex 1 out edges vertex 2 in edges 1. Columns 2. Labels 3. Directions 4. Index Properties 5. Non-index Properties edge 2 label edge 2 source vertex edge 2 target vertex 1 2 comment 3 5 date = know 4 created edge 5 properties name = josh age = 32 vertex 4 id vertex 4 properties 14

15 How to store the data - Edge Logical View 1. Snapshot edges : Up-to-date status of edge row column Tgt Vertex ID1 Tgt Vertex ID2 Tgt Vertex ID3 Src Vertex ID1 Properties Properties Properties Src Vertex ID2 Properties Properties Properties a. Fetching an edge between two specific vertex b. Lookup Table to reach indexed edges for update, increment, delete operations 15

16 How to store the data - Edge Logical View 2. Indexed edges : Edges with index row column Index Values Tgt Vertex ID1 Index Values Tgt Vertex ID2 Src Vertex ID1 Non-index Properties Non-index Properties a. Fetches edges originating from a certain vertex in order of index 16

17 How to store the data - Edge Physical View - table schema 1. Snapshot Edge a. Rowkey Murmur Hash Src Vertex ID Label ID Direction Index Sequence Is Inverted 16 bit variable length 30 bit 2 bit 7bit 1 bit Vertex IDs can be encoded with 8 bit header + byte array (long, integer, short, byte, string) 17

18 How to store the data - Edge Physical View - table schema 1. Snapshot Edge b. Qualifier Target Vertex ID variable length c. Value All Property Key Value Pairs variable length 18

19 How to store the data - Edge Physical View - table schema 2. Indexed Edge a. Rowkey Murmur Hash Src Vertex ID Label ID Direction Index Sequence Is Inverted 16 bit variable length 30 bit 2 bit 7bit 1 bit Vertex IDs can be encoded with 8 bit header + byte array (long, integer, short, byte, string) 19

20 How to store the data - Edge Physical View - table schema 2. Indexed Edge b. Qualifier Index Property Values variable length Tgt Vertex ID variable length c. Value Non-index Property Key Value Pairs variable length 20

21 How to store the data - Vertex Logical View 1. Vertex : Up-to-date status of Vertex row column Property Key1 Property Key2 Src Vertex ID1 Value1 Value2 Vertex ID2 Value1 Value2 21

22 How to store the data - Vertex Physical View - table schema 1. Vertex : Up-to-date status of Vertex a. Rowkey Murmur Hash Column ID Vertex ID 16 bit integer(32bit) variable length b. Qualifier c. Value Property Key Byte(8 bit) Property Value variable length 22

23 How to read the data - GetEdges Using a custom query DSL on top of HTTP User 1 curl -XPOST localhost:9000/graphs/getedges -H 'Content-Type: Application/json' -d ' { "srcvertices": [{"servicename": "s2graph", "columnname": "account_id", "id":1}], Friends Friends Step 1 "steps": [ [{"label": "friends", "direction": "out", "limit": 100}], // step friend 1 friend 2 [{"label": "hear", "direction": "out", "limit": 10}] } ] hear time: hear time: hear time: Step 2 ' Steps = a list of Step Step = contains the labels to traverse and how to rank them in the result Don t let go let it be let it go 23

24 How to read the data - GetEdges Example Friend list User 1 curl -XPOST localhost:9000/graphs/getedges -H 'Content-Type: Application/json' -d ' { "srcvertices": [{"servicename": "s2graph", "columnname": "account_id", "id":1}], } ' "steps": [ ] [{"label": "friends", "direction": "out", "limit": 100}], // step Friends Friends friend 1 friend 2 24

25 How to read the data - GetEdges Example Songs my friends have listened User 1 curl -XPOST localhost:9000/graphs/getedges -H 'Content-Type: Application/json' -d ' { Friends Friends "srcvertices": [{"servicename": "s2graph", "columnname": "account_id", "id":1}], "steps": [ [{"label": "friends", "direction": "out", "limit": 50, scoring : { score : 1.0}], friend 1 friend 2 [{"label": "listen", "direction": "out", "limit": 10}] ] } hear time: hear time: hear time: ' Don t let go let it be let it go Reference : 25

26 How to read the data - GetEdges Example Similar songs to songs that I have listened to. User 1 curl -XPOST localhost:9000/graphs/getedges -H 'Content-Type: Application/json' -d ' { } "srcvertices": [{"servicename": "s2graph", "columnname": "account_id", "id":1}], "steps": [ [{"label": "listen", "direction": "out", "limit": 50}], [{"label": "similar_song", "direction": "out", "limit": 10, scoring : { score : 1.0}] ] hear time: hear time: hear time: Don t let go let it be let it go similar_song similarity: 0.3 similar_song similarity: 0.4 similar_song similarity: 0.6 let it bleed Hey jude Do you wanna build a snowman? 26

27 How to read the data - GetVertices curl -XPOST localhost:9000/graphs/getvertices -H 'Content-Type: Application/json' -d ' [ {"servicename": "s2graph", "columnname": "account_id", "ids": [1, 2, 3]}, {"servicename": "kakaomusic", "columnname": "user_id", "ids": [1, 2, 3]} ] ' User 1 {created_at: , updated_at: } User 2 {created_at: , updated_at: } 27

28 How to write the data - Insert curl -XPOST localhost:9000/graphs/edges/insert -H 'Content-Type: Application/json' -d ' [ {"from":1,"to":2,"label":"graph_test","props":{"time":-1, "weight":10},"timestamp": }, ] ' User 1 User 2 28

29 How to write the data - Delete curl -XPOST localhost:9000/graphs/edges/delete -H 'Content-Type: Application/json' -d ' [ {"from":1,"to":2,"label":"graph_test","timestamp": }, {"from":1,"to":3,"label":"graph_test","timestamp": }, ] ' User 1 User 2 29

30 How to write the data - Update curl -XPOST localhost:9000/graphs/edges/update -H 'Content-Type: Application/json' -d ' [ {"from":1,"to":2,"label":"graph_test","timestamp": , "props": {"is_hidden": true, status : 200}, {"from":1,"to":3,"label":"graph_test","timestamp": , "props": {"status": -500} ] friend {is_hidden:true, User 1 User 2 status:200} 30

31 REST API Spec. Read Write Management 1. getedges 2. checkedge 3. getedgescount 4. getvertices 1. insert 2. delete 3. update 4. increment 1. create service (vertex type) 2. create label (edge type) 3. add Index 31

32 HBase Table Configuration 1. setdurability(durability.async_wal) 2. setcompressiontype(compression.algorithm.lz4) 3. setbloomfiltertype(bloomtype.row) 4. setdatablockencoding(datablockencoding.fast_diff) 5. setblocksize(32768) 6. setblockcacheenabled(true) 7. pre-split by (Intger.MaxValue / regioncount). regioncount = 120 when create table(on 20 region server). 32

33 HBase Cluster Configuration each machine: 8core, 32G memory hfile.block.cache.size: 0.6 hbase.hregion.memstore.flush.size: 128MB otherwise use default value from CDH

34 Overall Architecture Clients S2graph 34

35 Compare to other Online GraphDBs Titan (v0.4.2) a. Pros - Rich API and easy to setup - Relatively large community - Transaction handling b. Cons - Using it s own ID system; less efficient for graph traversal (details in next slide) - Index data stored on one region (hotspot) with strong consistency option - Not many references on Titan with HBase comparing to other storages 35

36 Compare to Titan Titan is less efficient for graph traversal - For following 1 normal graph traversal query, Vertex( userid: 1 ).out( friends ).limit(10).out( friends ).limit(10) User 1 friends friends 36

37 Compare to Titan (cont) Vertex( userid:1 ).out( friend ).limit(10).out( friend ).limit(10) Titan A S2grap B h e C f D Titan S2graph # of read requests on HBase 112 = 1 (Vertex Lookup : a) + 1 (1st step edges : b) + 10 (2nd step edges : c) (Destination Vertices : d) 11 = 1 (1step edges : e) + 10 (2nd step edges : f) 37

38 Performance 1. Test data a. Total # of Edges: 9,000,000,000 b. Average # of adjacent edges per vertex: 500 c. Seed vertex: vertices that has more than 100 adjacent edges. 2. Test environment a. Zookeeper server: 3 b. HBase Masterserver: 2 c. HBase Regionserver: 20 d. App server: 8 core, 16GB Ram 38

39 Performance 2. Linear scalability QPS(Query Per Second) Latency(ms) QPS # of app server QPS 4,000 3,000 2,000 1, Benchmark Query : src.out( friend ).limit(50).out( friend ).limit(10) - Total concurrency: 20 * # of app server 3,097 1, # of app server Latency 39

40 Performance 3. Varying width of traverse (tested with a single server) QPS Latency(ms) 2, , QPS 1, Latency Limit on first step - Benchmark Query : src.out( friend ).limit(x).out( friend ).limit(10) - Total concurrency = 20 * 1(# of app server) 40

41 Performance 3. Varying width of traverse (tested with a single server) QPS Latency(ms) 2, , QPS 1, Latency Limit on first step - Benchmark Query : src.out( friend ).limit(x).out( friend ).limit(10) - Total concurrency = 20 * 1(# of app server)

42 Performance 4. Different query path(different I/O pattern) QPS Latency(ms) QPS Latency > > > 10 -> > 5 -> 10 -> > 5 -> 2 -> 5 -> 10 0 limits on path - All query touch 1000 edges. - each step` limit is on x axis. - Can expect performance with given query`s search space. 42

43 Performance 5. Write throughput per operation on single app server 5 Insert operation 3.75 Latency Request per second 43

44 Performance 6. write throughput per operation on single app server 8 Update(increment/update/delete) operation 6 Latency Request per second 44

45 Stats 1. HBase cluster per IDC (2 IDC) - 3 Zookeeper Server - 2 HBase Master - 20 HBase Slave 2. App server per IDC - 10 server for write-only - 20 server for query only 3. Real traffic - read: over 10K request per second - now mostly 2 step queries with limit 100 on first step. - write: over 5k request per second * Deep traversal queries are not counted since it is in test stage for production 45

46 46

47 Through HBase! 47

Now Available As an Open Source - https://github.

48 Now Available As an Open Source Finding a mentor Contact - Taejin Chin : taejin.chin@gmail.com - Doyoung Yoon : shom83@gmail.com 48

49 Appendix 3.5x performance improvement using Asynchbase Native Client QPS Native Client Latency(ms) Asnychbase QPS Asynchbase Latency(ms) 2, ,000 1, , ,500 1, ,192 QPS 1, Latency QPS 1, Latency # of app server # of app server - Benchmark Query : src.out( friend ).limit(50).out( friend ).limit(10) - Test seed edges have adjacent edges more than 100: 30millions - Total concurrency: 20 * # of app server 49

S2Graph : A large-scale graph database with Hbase

daumkakao S2Graph : A large-scale graph database with Hbase Reference 1. HBase Conference 2015 1.http://www.slideshare.net/HBaseCon/use-cases-session-5 2.https://vimeo.com/128203919 2. Deview 2015 3. Apache