A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics

Size: px

Start display at page:

Download "A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics"

Ezra Stafford
5 years ago
Views:

1 A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics Ilie Gabriel Tanase Research Staff Member, IBM TJ Watson Yinglong Xia, Yanbin Liu, Wei Tan, Jason Crawford, Ching-Yung Lin IBM TJ Watson Lifeng Nai Georgia Tech 2014/02/15

Streams (ISS) GBase (update, scan, operators, indexing)) HBase HDFS Graph Search Graph Query Graph Matching Huge Network Visualization Geo Network Visualization Network Info Flow

2 Graph Accelerat or System G v1.0 Architecture Visualization Analytics Middleware Database I2 3D Network Visualization Communities Centralities Ego Net Features BigInsights Hadoop Network Propagation Infospher e Streams (ISS) GBase (update, scan, operators, indexing)) HBase HDFS Graph Search Graph Query Graph Matching Huge Network Visualization Geo Network Visualization Network Info Flow Shortest Paths Graph Sampling Graph Processing Interface Shared Memory Graph Library Generic Graph Library Graph Data Interface TinkerPop Compliant DBs DB2 RDF DB2 Distr. Memory Graph Library Graph Communication Layer (PAMI/RDMA) Graphical Model Visualization Bayesian Networks Latent Net Inference Markov Networks Native Store

3 slower speed of processing faster The Spectrum of Open Source Graph Technology single machine # machines clusters Jung Fulgora Aurelius NetworX NetworX.org GraphLab CMU DEX Sparsity Tech. Neo4J Neo4j.org OrientDB NuvolaBase RedisGraph MIT Titan Aurelius Giraph Apache Yahoo,FB, Faunus Aurelius in memory disk 2014/02/15

4 Motivation and Requirements Flexible Graph Datastructure In memory only, persistent, both Directed, undirected, directed with predecessors ACID properties Run well on IBM machines including X86, Power, Bluegene Large memory, large number of cores Clusters with Infiniband or specialized networks (RDMA) Commercial solution

5 IBM Parallel Programming Library C++ Object oriented design - inheritance Generic using templates Datastructures Graphs, Hash Tables, Arrays Large shared memory Concurrency Distributed memory clusters Messaging API based on active messages and RDMA

6 IBM PPL Graph class hierarchy Graph<VertexProperty, EdgeProperty, Directness> - in memory only - custom vertex and edge property InDiskGraph - in memory and persistent storage MultipropertyGraph - StorageType : InMemory, Hybrid

7 Multiproperty Graph Label Key Vertex Person Name : Gabriel SSN : xxx-xx Value Friend since : 2010 Label Works at Since : 2010 Edge Edge Name : John 2014/02/15 Company Name:IBM

8 Persistent Storage Write through policy for now Separate structure from properties : benefits computations based on structure only Efficient graph loading : on demand Versioning

9 Programming Model/ Runtime A graph is a collection of vertices Each vertex maintains its in and out edges Parallel processing on IBMPPL graph Task based model of parallelism execute_tasks(wf, num_tasks) for_each(graph, wf); schedule_task_graph(tg) Work stealing Two level nested parallelism Within shared memory for now

10 Execution Time(sec) Performance Add Vertex Add vertices and for each vertex add a property Indexed v=add_vertex(); v.set_property( name, vertex0 ); Titan with Berkeley DB backend Intel Haswell 24 core 2.7GHz and 256 GB, SSD Add vertex SG Add vertex Neo4j Add vertex Titan M 2M 3M 4M 5M 10M Number of vertices

11 Execution Time(sec) Performance Add Edge Add edges randomly The source and target are specified as vertex properties add_edge( vertex0, vertex7 ) Index lookup AddEdge SG AddEdge Neo4j AddEdge Titan M 2M 3M 4M 5M 10M # Vertices (Edges=10*Vertices)

12 TEPS Performance - Query For a given vertex collect all its neighbors up to depth=3 ~1000 edges traversed per query M 2M 3M 4M 5M 10M Graph Size (Vertices, Edges=10xVertices) TEPS_SG TEPS_NEO4j TEPS Titan

13 2014/02/15

14 RDF Graph Construction Load the.csv files for vertices Load the.csv files for edges Construct property graph in memory only 2014/02/15

TEPS Query 2 4000000 3500000 3000000 2500000 2000000 2014/02/15 1500000

15 TEPS Query /02/ Edges/query

TEPS (TCMalloc) Impact of Parallelism on Throughput #concurrent component Queries are processed by a single thread 160000000 TCMalloc Queries are

16 TEPS (TCMalloc) Impact of Parallelism on Throughput #concurrent component Queries are processed by a single thread TCMalloc Queries are assigned to threads evenly Q2 Q2 P Q4 Q4 P Q6 Q6 P Cores/threads 2014/02/15

17 Conclusion Graph databases are gaining in popularity Google, Facebook, Twitter, Paypal, BAML Feature System G Native Store Neo4j Titan Back-end Graph Graph Non-graph Scaling Yes Moderate Yes Traversal efficiency Perfect Good Poor Schemaless Yes Yes No User defined function Yes No Yes Performance-critical App. Perfect Good Poor Multi-language APIs C++, Java, Python, Shell Java, Cypher Java, Gremlin

Dynamic Graph Query Support for SDN Management. Ramya Raghavendra IBM TJ Watson Research Center

Dynamic Graph Query Support for SDN Management Ramya Raghavendra IBM TJ Watson Research Center Roadmap SDN scenario 1: Cloud provisioning Management/Analytics primitives Current Cloud Offerings Limited