Social Network Analytics on Cray Urika-XA

Size: px

Start display at page:

Download "Social Network Analytics on Cray Urika-XA"

Lily Dorsey
6 years ago
Views:

1 Social Network Analytics on Cray Urika-XA Mike Hinchey, Technical Solutions Architect Cray Inc, Analytics Products Group April, 2015

2 Agenda 1. Introduce platform Urika-XA 2. Technology and architecture for analytics Apache Spark 3. Use case analysis and results Social Network Analysis 4. Conclusions

3 Urika-XA Hardware Extreme Analytics 48 Analytic Nodes 96 CPU's, 1536 cores 6 TB total RAM 38 TB total local SSD (for HDFS) 48 TB total local HDD 120 TB Sonexion 900 Lustre Storage FDR InfiniBand Fabric Network Standard 42U Rack Dual rack configuration also available

4 Urika-XA Software Extreme Analytics Cloudera Hadoop Distribution and Management UI HDFS on the 38 TB local SSD YARN manages jobs on 48 nodes Hadoop MapReduce Apache Spark

5 Urika-GD 4 TB RAM 128 XMT compute processors 128 hardware threads per processor Lustre file system RDF: W3C Resource Description Framework SPARQL, W3C graph query language Graph Discovery

6 Goals for this Project 1. Business Use Case Demonstrate analytics On a broadly accessible use case Showing valuable insights 2. Technology and Architecture Bring together various technologies and techniques Demonstrate architecture of an end-to-end solution Cray R&D also uses this for performance tests

7 Business Use Case Collect data from social media Discover communities of users with interest in a particular topic (consumer electronics, sports) Identify users according to role: key influencers, rebroadcasters, connectors

8 Process Overview

9 Technology Overview Apache Spark applications Data load and transform Community detection Analytics Query for visualization Web app, JavaScript Query from the Spark app Charts and graphs to visualize data and results This presentation

10 Technology and Architecture Bring together various technologies and techniques Demonstrate an end-to-end solution

11 Lambda Architecture Principles for an analytics system that includes Batch and Real-time pipelines Based on functional programming (lambdas) To achieve consistency, reliability, etc Source data is immutable, append-only Business/analytics code duplicated for Batch and Real-time use cases

12 Lambda Architecture Batch layer for completeness and accuracy: typically Hadoop/MapReduce Speed layer for real-time, minimal latency, may sacrifice accuracy Batch layer Serving layer Data stream Presentation Real-time stream

13 Kappa Architecture Rethinking the Lambda Architecture - multiple frameworks and duplication of code is too difficult Rethinking the traditional database - based on a transaction log, but only internally Use Streams everywhere, the transaction/event log is the foundation of all data Avoid the traditional batch pipeline (where possible, wrt legacy software) Avoid inconsistent caches of data, like memcached, within apps, etc

14 Kappa Architecture Batch is a slow-lane stream, and allows for re-processing of historical data Real-time is a fast-lane stream using the same framework, so code is shared Batch stream Data stream Serving layer Presentation Real-time

15 Apache Spark Functional API, immutability, stateless Immutable dataset abstraction, transparently distributed High-level API: map, reduce, filter, group by, join, union, left outer join Graph Algorithms: pagerank, svd++, connected components, shortest paths Machine Learning: k-means, linear regression, logical regression, naive bayes Streaming: real-time, periodic

16 Why XA? Considering the principles of Lambda and Kappa Architectures, And the capabilities of Spark, What is the value of Urika-XA? Pre-configured Hadoop/Yarn cluster Minimize time to value for a project Hardware architecture built for both batch and real-time

17 Why XA? Hardware Architecture built for both batch size and real-time speed Lots of memory and CPU Perform numerous transformations and joins in memory HDFS on fast, local SSD Bigger joins, and temporary files are fast Shared file system, Lustre Parallel and fast, for input and output data Reliable without 3x data duplication

18 SNA - ETL Source data is stored in immutable files Start ETL (extract, transform, load) process based on some start data (to re-process old data) The spark-streaming window specifies how much data per micro-batch

19 SNA - Real-time Fast-lane window is seconds: for real-time alerts, complex events Aggregations, metrics Complex Event Processing (CEP), such as spotting trending hashtags

20 Community Detection Label Propagation (LP) is a Graph-based Community Detection Algorithm (CDA) LP is not implemented streaming, so executed periodically, on one day of collected data More data produces better results, more meaningful communities This is done in a second stream: not real-time, longer window, lower latency

21 Visualization This presentation is a web app, loads data that is output from the Spark job d3.js: render charts and graphs in SVG crossfilter.js: manipulate data across multiple dimensions dc.js: reusable charts

22 SNA Analytics Pipeline

23 SNA Analytics Pipeline Social Network Analytics ETL Algorithms Analysis Visualization ETL (extract, transform, load): Spark Streaming, Scala Algorithms: GraphX Label Propagation, Machine Learning Analysis: Spark, Scala, SQL Visualization: JavaScript, D3, SVG

24 Source Data - Twitter.com Tweet download is based on search terms (related to consumer electronics, sports, life sciences, etc) Streaming download since April 2014 Data archived in files to allow reprocessing

25 Source Data - Twitter.com tweets per hour 45,000,000 40,000,000 35,000,000 30,000,000 25,000,000 20,000,000 15,000,000 10,000,000 5,000, The full Twitter firehose is about 600M tweets/day. During the displayed timeframe, we collected 674,106,415 tweets, about 0.91% of the firehose. tweets per day 8,000,000 7,000,000 6,000,000 5,000,000 4,000,000 3,000,000 2,000,000 1,000, /29 10/01 10/08 10/15 10/22 10/2911/01 11/08 11/15 11/22 11/29 12/01 12/08 12/15 12/22 12/2901/01 01/08 01/15 01/22 01/2902/01

26 Source Data - Storage JSON saved to files, gzipped 2,290 files, 317GB files per day /29 10/01 Mon Wed10/08 Wed10/15 Wed10/22 Wed10/29 11/01 WedSat11/08 Sat11/15 Sat11/22 Sat11/29 12/01 Sat Mon12/08 Mon12/15 Mon12/22 Mon12/29 01/01 MonThu01/08 Thu01/15 Thu01/22 Thu01/29 02/01 ThuSun

27 SNA - Counts and Aggregations Aggregations are done for both periodic, per window running total since processing began Tweets Users Unique hashtags

28 More Counts and Aggregations Hashtags matched to topics Top hashtags Top hashtags per user Errors in source data NSFW: censor out some tweets based on keywords

29 Build the network for CDA Label propagation (LP) is a community detection algorithm (CDA), builtin to Spark-GraphX Input is a network - a list of relationships between entities We'll look at users that mention other users in tweets Further restrict to where Users have mentioned each other If user A mentioned user B and B mentioned A then infer that A knows B

30 Communities LP results in one community for each user Community User member ACBPSTL Real_DealRaps wildabeast24 JuggDaGreat meggahpopular CraftMatik AlMcFallinIII lyricalvinom MiltownBloe ParkLyfeEnt CORTEZ_HSP TheSaurus831 CraveMyThoughts TheComedyHumor AdorableWords femaienotes diaryforteens FemaIeThings TeenagerNotes FemaleTexts StealHisHeart TheseDamnQuote LooneyTunes002 PolitiBunny truckinmatador Ann_Marie1 medfordcaniac cdnkaren fazwiesenfeld andilinks grsvt81 sarahzview Philscbx Brockr1967Brock MLKstudios AmareshMisraFC justinwooten

31 Community metrics Count the users in each community Density is proportion of users that know each other Filter out tiny and huge communities as not interesting

32 Community Characterization Community Topic references Find ways to describe communities Most popular topics among users Sports Finance Consumer Electron

33 Community Characterization Community Hashtag references hardwork GenerationsLegacy RageBoy Bellarke autism NBA watch Music cover MaxScherz quote google Ubuntu vaccines money start pharm Business fandomscollide stream Most popular hashtags among users

34 User Roles Identify user roles within community key influencers: retweeted by others rebroadcasters: retweet a lot Identify users role between community: connectors relationships with people in different groups and the strength to each community is balanced

35 Results

36 Communities and Popular Hashtags

37 User Roles within a Community

38 Connector Role across Communities

39 Conclusions

40 Conclusions Analytics needs a variety of techniques: Graph, Machine Learning, Iterative, Streaming Spark: functional, high-level, transparently distributed Urika-XA: pre-configured cluster, 6T memory

41 References Apache Spark: Twitter Data: Lambda Architecture: Kreps, Jay, "Questioning the Lamba Architecture", 7/2/2014, Kleppmann, Martin, "Turning the database inside out with Apache Samza", 9/21/2014, DC, dimensional charting:

42 Questions? Or contact me later... Cray Analytics, Urika-XA: Mike Hinchey,

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes