소셜네트워크 빅데이터분석기법. December 7, 2012 KAIST Jae-Gil Lee

Size: px

Start display at page:

Download "소셜네트워크 빅데이터분석기법. December 7, 2012 KAIST Jae-Gil Lee"

Jeffery Booth
5 years ago
Views:

1 소셜네트워크 빅데이터분석기법 December 7, 2012 KAIST Jae-Gil Lee

2 강연자소개 - 이재길교수 약력 2010 년 12 월 ~ 현재 : KAIST 지식서비스공학과조교수 2008 년 9 월 ~2010 년 11 월 : IBM Almaden Research Center 연구원 2006 년 7 월 ~2008 년 8 월 : University of Illinois at Urbana- Champaign 박사후연구원 연구분야 시공간데이터마이닝 ( 경로및교통데이터 ) 소셜네트워크및그래프데이터마이닝 빅데이터분석 (MapReduce 및 Hadoop) 연락처 홈페이지 :

3 Contents Big Data and Social Networks Online Analysis Offline Analysis Summary

4 1. Big Data and Social Networks

5 Big Data Social Networks The online social network (OSN) is one of the main sources of big data

6 Data Growth in Facebook

7 Data Growth in Twitter

8 Some Statistics on OSNs Twitter is estimated to have 140 million users, generating 340 million tweets a day and handling over 1.6 billion search queries per day As of May 2012, Facebook has more than 900 million active users; Facebook has million monthly unique U.S. visitors in May

9 Data Characteristics Relationship data: e.g., follower, Content data: e.g., tweets, Location data contents a user location relationship

10 Graph Data A social network is usually modeled as a graph A node an actor An edge a relationship or an interaction The graph is diverse directed vs. undirected weighted vs. unweighted

11 Categories of Graph Data Analysis Online analysis Example: for a given user, finding anyone whose first name is David among his friends, his friends friends, and his friends friends friends Typically using graph databases (e.g., Neo4j, HyperGraphDB, FlockDB) Offline analysis Example: calculating PageRank for the entire graph Typically using distributed, parallel systems (e.g., MapReduce, Pregel, Trinity)

12 2. Online Analysis

13 Graph Databases (1/2) Neo4j Open source, current version: 1.8 (as of Dec. 2012) Running on a single machine HyperGraphDB Korbix Software, current version: 1.2 (as of Dec. 2012) Running on a single machine FlockDB Open source, current version: 1.8 (as of Dec. 2012) Running on a cluster of machines Being used by Twitter to store social graphs and indexes

14 Graph Databases (2/2) Storing data as nodes and relationships Both nodes and relationships can hold properties in a key/value fashion Being able to navigate the structure

15 FlockDB A distributed graph database for storing adjacency lists, with goals of supporting: A high rate of add/update/remove operations Potentially complex set arithmetic queries Paging through query result sets (over 1M entries) Ability to archive and later restore archived edges Horizontal scaling including replication Online data migration but not including: Multi-hop queries (or graph-walking queries) Automatic shard migrations

16 FlockDB for Twitter As of April 2010, Storing 13+ billion edges Sustaining 20k writes/second at peak Sustaining 100k reads/second at peak

17 Temporal and Count Operations Temporal Counts Intersection

18 Set Operations This tweet needs to be delivered to people who follow (13M followers) (530K followers)

19 Adjacency Lists (1/2) Storing the follower relationship as an edge position: used for sorting (e.g., current time) source_id:int64 destination_id:int64 position:int64 state:int8 Normal, Removed, Archived

20 Adjacency Lists (2/2) Storing an edge in both directions Forward Backward source_id destination_id position state destination_id source_id position state :50: :50: :51: :51: :54: :53:24 Indexed and partitioned by Can efficiently answer the question Who follows A? as well as Whom is A following?

indexed range query The app servers (affectionately called

21 Partitioning / Sharding Data is partitioned by node, so the queries can be answered by a single partition, using an indexed range query The app servers (affectionately called flapps ) are stateless and are horizontally scalable

22 Example Queries How many people are following user 1? flock.select(nil, :follows, 1).to_a Who's reciprocally following user 1? flock.select(1, :follows, nil).intersect(nil, :follows, 1).to_a How about the union then? flock.select(1, :follows, nil).union(nil, :follows, 1).to_a Who's following user 1 that user 1 is not following back? flock.select(nil, :follows, 1).difference(1, :follows, nil).to_a

23 3. Offline Analysis

24 Analysis at Scale Example: Running PageRank across users to calculate reputations To give any Twitter user a score from 1~10 based on their followers networks of followers

PageRank Overview (1/4) Google describes PageRank: PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving

25 PageRank Overview (1/4) Google describes PageRank: PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value. and our technology uses the collective intelligence of the web to determine a page's importance A page referenced by many high-quality pages is also a high-quality page

26 PageRank Overview (2/4) Formula OR PR(A): PageRank of a page A d: the probability, at any step, that the person will continue which is called a damping factor d (usually, set to be 0.85) L(B): the number of outbound links on a page B N: the total number of pages

27 PageRank Overview (3/4) Example PR(A) = (1 d) * (1/N) + d * (PR(C) / 2) PR(B) = (1 d) * (1/N) + d * (PR(A) / 1 + PR(C) / 2) PR(C) = (1 d) * (1/N) + d * (PR(B) / 1) Set d = 0.70 for ease of calculation PR(A) = * PR(C) PR(B) = * PR(A) * PR(C) B PR(C) = * PR(B) Iteration 1: PR(A) = 0.33, PR(B) = 0.33, PR(C) = 0.33 Iteration 2: PR(A) = 0.22, PR(B) = 0.45, PR(C) = 0.33 Iteration 3: PR(A) = 0.22, PR(B) = 0.37, PR(C) = 0.41 Iteration 9: PR(A) = 0.23, PR(B) = 0.39, PR(C) = 0.38 A C

28 PageRank Overview (4/4) A random surfer selects a page and keeps clicking links until getting bored, then randomly selects another page PR(A) is the probability that such a user visits A (1-d) is the probability of getting bored at a page (d is called the damping factor) PageRank matrix can be computed offline Google takes into account both the relevance of the page and PageRank

MapReduce Basics To handle big data, Google proposed a new approach called MapReduce MapReduce can crunch huge amounts of data by splitting the task over multiple

29 MapReduce Basics To handle big data, Google proposed a new approach called MapReduce MapReduce can crunch huge amounts of data by splitting the task over multiple computers that can operate in parallel No matter how large the problem is, you can always increase the number of processors (that today are relatively cheap)

30 Two Steps of MapReduce Map step: The master node takes the input, divides it into smaller subproblems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node. Example: Reduce step: The master node then collects the answers to all the subproblems and combines them in some way to form the output the answer to the problem it was originally trying to solve

31 Example Programming Model employees.txt # LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90, mapper def getname (line): return line.split( \t )[1] reducer def addcounts (hist, name): hist[name] = \ hist.get(name,default=0) + 1 return hist input = open( employees.txt, r ) intermediate = map(getname, input) Q: What is the frequency of each first name? result = reduce(addcounts, \ intermediate, {}) Note: pp. 31~36 are borrowed from KDD 2011 tutorial Large-scale Data Mining: MapReduce and Beyond 31

32 Example Programming Model employees.txt # LAST FIRST SALARY Smith John $90,000 Brown David $70,000 Johnson George $95,000 Yates John $80,000 Miller Bill $65,000 Moore Jack $85,000 Taylor Fred $75,000 Smith David $80,000 Harris John $90, Q: What is the frequency of each first name? mapper def getname (line): return (line.split( \t )[1], 1) reducer def addcounts (hist, (name, c)): hist[name] = \ hist.get(name,default=0) + c return hist input = open( employees.txt, r ) intermediate = map(getname, input) result = reduce(addcounts, \ intermediate, {}) Key-value iterators

33 Example Programming Model Hadoop / Java public class HistogramJob extends Configured implements Tool { public static class FieldMapper extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { typed private static LongWritable ONE = new LongWritable(1); private static Text firstname = new public void map (LongWritable key, Text value, OutputCollector<Text,LongWritable> out, Reporter r) { firstname.set(value.tostring().split( \t )[1]); output.collect(firstname, ONE); } } // class FieldMapper non-boilerplate

34 Example Programming Model Hadoop / Java public static class LongSumReducer extends MapReduceBase implements Mapper<LongWritable,Text,Text,LongWritable> { private static LongWritable sum = new public void reduce (Text key, Iterator<LongWritable> vals, OutputCollector<Text,LongWritable> out, Reporter r) { long s = 0; while (vals.hasnext()) s += vals.next().get(); sum.set(s); output.collect(key, sum); } } // class LongSumReducer

35 Example Programming Model Hadoop / Java public int run (String[] args) throws Exception { JobConf job = new JobConf(getConf(), HistogramJob.class); job.setjobname( Histogram ); FileInputFormat.setInputPaths(job, args[0]); job.setmapperclass(fieldmapper.class); job.setcombinerclass(longsumreducer.class); job.setreducerclass(longsumreducer.class); //... JobClient.runJob(job); return 0; } // run() public static main (String[] args) throws Exception { ToolRunner.run(new Configuration(), new HistogramJob(), args); } // main() } // class HistogramJob

36 Execution Model: Flow Input file Smith John $90,000 SPLIT 0 SPLIT 1 John 1 MAPPER MAPPER John 2 REDUCER Key/value iterators Output file PART 0 Yates John $80,000 SPLIT 2 John 1 MAPPER REDUCER PART 1 SPLIT 3 MAPPER Sort-merge All-to-all, hash partitioning Sequential scan 36

37 Apache Hadoop The most popular open-source implementation of MapReduce HBase Pig Hive Chukwa MapReduce HDFS Zoo Keeper Core Avro

38 PageRank on MapReduce (1/2) Map: distributing PageRank credit to link targets Reduce: summing up PageRank credit from multiple sources to compute new PageRank values Iterate until convergence

39 PageRank on MapReduce (2/2) Map (nid n, node N) p N.PageRank / N.AdjacencyList emit (nid n, node N) // Pass along the graph structure for nid m N.AdjacencyList do emit (nid m, p) // Pass a PageRank value to its neighbors Reduce (nid m, [p 1, p 2, ]) M 0 for p [p 1, p 2, ] do if IsNode(p) then M p // Recover the graph structure else s s + p // Sum up the incoming PageRank contributions M.PageRank s emit (nid m, node M)

40 Implementation Cloud 9 Jimmy Lin and Michael Schatz. Design Patterns for Efficient Graph Algorithms in MapReduce. Proceedings of the 2010 Workshop on Mining and Learning with Graphs Workshop (MLG-2010), July 2010, Washington, D.C

41 Pig Pig raises the level of abstraction for processing large datasets Turning the transformations into a series of MapReduce jobs The language used to express data flows is called Pig Latin

42 A Real Pig Script

43 Java program for the same task

44 4. Summary

45 Summary FlockDB: Real-time Analysis Hadoop: Storing and Analyzing Data Cassandra: Storing Tweets HBase: Searching People Pig: Easier (SQL-like) Analysis Scribe: Log Data Collection

46 THANK YOU

Large-scale Data Mining: MapReduce and beyond Part 1: Basics. Spiros Papadimitriou, Google Jimeng Sun, IBM Research Rong Yan, Facebook

Large-scale Data Mining: MapReduce and beyond Part 1: Basics Spiros Papadimitriou, Google Jimeng Sun, IBM Research Rong Yan, Facebook Data everywhere 2 Data everywhere Flickr (3 billion photos) YouTube