Analyzing Flight Data

Size: px

Start display at page:

Download "Analyzing Flight Data"

Lynne Lamb
5 years ago
Views:

1 IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, IBM Corporation

2 Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo Scenario Overview Demo Wrap-up IBM Corporation

3 Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo Scenario Overview Demo Wrap-up IBM Corporation

4 What is Spark? Spark is an open source in-memory application framework for distributed data processing and iterative analysis on massive data volumes Analytic Operating System IBM Corporation

5 Key reasons for interest in Spark Performant In-memory architecture greatly reduces disk I/O Anywhere from x faster for common tasks Productive Concise and expressive syntax, especially compared to prior approaches Single programming model across a range of use cases and steps in data lifecycle Integrated with common programming languages Java, Python, Scala New tools continually reduce skill barrier for access (e.g. SQL for analysts) Leverages existing investments Works well within existing Hadoop ecosystem Improves with age Large and growing community of contributors continuously improve full analytics stack and extend capabilities IBM Corporation

Spark includes a set of core libraries that enable various analytic methods which can process data from many sources executes SQL statements performs streaming analytics

dispatching, scheduling and basic I/O functions Spark SQL Spark Streaming MLlib (machine learning) Spark Core Engine GraphX (graph) BigInsights (HDFS) Cloudant dashdb

6 Spark includes a set of core libraries that enable various analytic methods which can process data from many sources executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework general compute engine, handles distributed task dispatching, scheduling and basic I/O functions Spark SQL Spark Streaming MLlib (machine learning) Spark Core Engine GraphX (graph) BigInsights (HDFS) Cloudant dashdb large variety of data sources and formats can be supported, both on premise or cloud SQL DB Object Storage IBM CLOUD OTHER CLOUD CLOUD APPS ON-PREMISE many others IBM Corporation

Spark Application Architecture A Spark application is initiated from a driver program Spark execution modes: Standalone with the built-in cluster manager Use

7 Spark Application Architecture A Spark application is initiated from a driver program Spark execution modes: Standalone with the built-in cluster manager Use Mesos as the cluster manager Use YARN as the cluster manager Standalone cluster on any cloud (BlueMix, IBM Softlayer, Amazon, Azure, ) IBM Corporation

8 Spark RDDs Immutable Two types of operations Transformations ~ DDL (Create View V2 as ) Lazy Evaluation val rddnumbers = sc.parallelize(1 to 10): Numbers from 1 to 10 val rddnumbers2 = rddnumbers.map (x => x+1): Numbers from 2 to 11 The LINEAGE on how to obtain rddnumbers2 from rddnumber is recorded It s a Directed Acyclic Graph (DAG) No actual data processing does take place Lazy evaluations Actions ~ Select (Select * From V2 ) Perform Computations rddnumbers2.collect(): Array [2, 3, 4, 5, 6, 7, 8, 9, 10, 11] Performs transformations and action Returns a value (or write to a file) Fault tolerance If data in memory is lost it will be recreated from lineage IBM Corporation

9 Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo Scenario Overview Demo Wrap-up IBM Corporation

10 Graphs are Central to Analytics Data is not just getting bigger, it s getting more connected In many use cases, the relationship between data points provides as much value or more than the data points themselves Discovering data relationships and interdependencies is critical to many applications fraud detection better understanding customer relationships ranking web pages or people in social networks Graph analytics is a powerful tool for understanding and exploiting the connections in data Graph applications are everywhere today and are a critical component of many next generation applications IBM Corporation

What is a Graph? A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that connect them.

11 What is a Graph? A graph is a mathematical structure used to model relations between objects. A graph is made up of vertices and edges that connect them. The vertices are the objects and the edges are the relationships between them. Directed graph A graph where the edges have a direction associated with them. An example of a directed graph is a Twitter follower. User Bob can follow user Carol without implying that user Carol follows user Bob. Regular graph Graph where each vertex has the same number of edges. An example of a regular graph is Facebook friends. If Bob is a friend of Carol, then Carol is also a friend of Bob IBM Corporation

12 Spark GraphX Graph processing system, NOT a database GraphX extends Spark RDD by introducing a Graph abstraction A directed multigraph with properties attached to each vertex and edge GraphX exposes a set of fundamental operators to support graph computation Subgraph, joinvertices, aggregatemessages, Algorithms to simplify graph analytics tasks In addition to a highly flexible API, GraphX comes with a growing library of graph algorithms PageRank, Triangle Counting, IBM Corporation

13 Spark GraphX Flexible Graphing GraphX unifies ETL, exploratory analysis, and iterative graph computation You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms with the API IBM Corporation

analysis Graph Databases Database transactions - updates and deletes Typically work

14 GraphX and the Alternatives GraphX Optimized for running complex algorithms on the entire graph Relational databases are inadequate for any real type of graph analysis Graph Databases Database transactions - updates and deletes Typically work with small sections of the graph Ex. Query small groups of vertices IBM Corporation

15 Graph Databases The same restrictions that enable graph databases to achieve substantial performance gains also limit their ability to express many of the important stages in a typical graph-analytics pipeline Often require data-movement outside of the graph topology to express operations that are more naturally expressed as relational/table operations IBM Corporation

16 GraphX Benefits Unify graph and data centric computation in one system with a single composable API Enables users to view data both as graphs and as collections (i.e., RDDs) or tables (DataFrames) without data movement or duplication IBM Corporation

17 Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo Scenario Overview Demo Wrap-up IBM Corporation

faulttolerant Directed multigraphs can have multiple edges in parallel Every edge and vertex has user defined

18 Property Graphs GraphX implements an object called the property graph Directed multigraph with user defined objects attached to each vertex and edge Like RDDs, property graphs are immutable, distributed, and faulttolerant Directed multigraphs can have multiple edges in parallel Every edge and vertex has user defined properties associated with it The parallel edges allow multiple relationships between the same vertices IBM Corporation

19 Vertex and Edge RDDs GraphX exposes RDD views of the vertices and edges stored within the graph The VertexRDD[A] extends RDD[(VertexID, A)] and adds the additional constraint that each VertexID occurs only once The EdgeRDD[ED] extends RDD[Edge[ED]] organizes the edges in blocks partitioned using one of the various partitioning strategies defined in PartitionStrategy IBM Corporation

20 Example Property Graph IBM Corporation

contain the username and occupation Edges with a string

21 Example Constructing a Property Graph Construct a property graph consisting of the various collaborators Vertex property might contain the username and occupation Edges with a string describing the relationships between collaborators IBM Corporation

22 Example Working with a Property Graph Deconstructing a Graph Vertex and edge views Use graph.vertices and graph.edges members Alternately, use the case class type constructor as in the following: IBM Corporation

23 Triplet Views Logically joins the vertex and edge properties RDD[EdgeTriplet[VD, ED]] contains instances of the EdgeTriplet class This join can be expressed in the following SQL expression: or graphically as: IBM Corporation

24 EdgeTriplet Class Extends the Edge class by adding the srcattr and dstattr members Renders a collection of strings describing relationships between users IBM Corporation

Graph Operators Similar to RDD basic operations like map, filter, and reducebykey Core operators have optimized implementations Graph Operators types: Property

25 Graph Operators Similar to RDD basic operations like map, filter, and reducebykey Core operators have optimized implementations Graph Operators types: Property Operators (mapvertices, mapedges, maptriplets) Structural Operators (reverse, subgraph, mask, groupedges) Join Operators (joinvertices, outerjoinvertices) IBM Corporation

26 Graph Operators - Subgraph The subgraph operator takes vertex and edge predicates and returns the graph containing only the vertices that satisfy the vertex predicate and edges that satisfy the edge predicate Vertices that satisfy the vertex predicate are connected The subgraph operator can be used in number of situations to restrict the graph to the vertices and edges of interest or eliminate broken links IBM Corporation

Graph Algorithms - PageRank An algorithm created by Google to rank websites in their search engine results named after Larry Page one of the founders of Google PageRank works by counting the number

27 Graph Algorithms - PageRank An algorithm created by Google to rank websites in their search engine results named after Larry Page one of the founders of Google PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is Underlying assumption is that more important websites are likely to receive more links from other websites The mathematics of PageRank are entirely general and apply to any graph or network in any domain e.g. Personalized PageRank is used by Twitter to present users with other accounts they may wish to follow IBM Corporation

Graph Algorithms - Triangle Counting GraphX implements a triangle counting algorithm The triangle is a three-node small graph, where every two nodes are connected Used in many real world applications

28 Graph Algorithms - Triangle Counting GraphX implements a triangle counting algorithm The triangle is a three-node small graph, where every two nodes are connected Used in many real world applications as a measure of clustering Determines the number of triangles passing through each vertex A vertex is part of a triangle when it has two adjacent vertices with an edge between them TriangleCount requires the edges to be in canonical orientation (srcid < dstid) and the graph to be partitioned using Graph.partitionBy E.g. RandomVertexCut collocates all same-direction edges between two vertices hashing the source and destination vertexids IBM Corporation

29 Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo Scenario Overview Demo Wrap-up IBM Corporation

30 Demo Scenario Explore and analyze airline data Vertices representing airports Edges representing flights between airports and their associated distance Use a number of operators provided by GraphX to analyze data in the graph and the relationship between the data E.g. find the airports with the greatest number inbound and outbound flights Employ graph operators to transform the graphs into new graphs Based on transformation criteria, like the distance between airports Employ graph algorithms included with GraphX, like PageRank and Triangle Counting, to determine the busiest airports IBM Corporation

31 Demo Scenario Data Airline data in CSV format is readily available on the US Bureau of Transportation (BTS) website This demo employs US flight data for March IBM Corporation

RDDs, so must convert the DataFrame into an RDD Extract data (airport IDs and airport codes) for the graph vertices Extract

32 Demo Flow Download the data (CSV format) Read in the CSV file as a DataFrame (infer the schema) Clean up the DataFrame Remove blank column and rows that contain nulls Convert the DataFrame to an RDD Use custom case class GraphX is based on RDDs, so must convert the DataFrame into an RDD Extract data (airport IDs and airport codes) for the graph vertices Extract data (origin airport ID, destination airport ID, distance between airports) for graph edges Create the EdgeRDD IBM Corporation

33 Example Demo Graph IBM Corporation

34 Demo Flow (continued) Create the graph Investigate the graph Show vertices Count number of vertices/airports Show edges/flights Count the number of edges/flights and distinct routes Query the graph based on vertex and edge attributes and properties Create a triple view of the graph Query the triplet view Compute the highest degree vertices (in, out, and total) Calculate Page Ranks for the graph vertices IBM Corporation

Demo Flow (conclusion) Create a subgraph Explore the subgraph Using both vertex predicates and edge predicates Create a subgraph for Triangle Counting TriangleCount requires the edges be

35 Demo Flow (conclusion) Create a subgraph Explore the subgraph Using both vertex predicates and edge predicates Create a subgraph for Triangle Counting TriangleCount requires the edges be in canonical orientation Also required that the graph is partitioned Create a Triangle Count graph Investigate the vertices/airports with the highest triangle count IBM Corporation

36 Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo Scenario Overview Demo Wrap-up IBM Corporation

37 Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo Scenario Overview Demo Wrap-up IBM Corporation

Summary Graphs provide a powerful way to model and analyze connected data GraphX builds on the massively parallel, fault-tolerant foundation of Spark to provide graph processing Spark provides the

38 Summary Graphs provide a powerful way to model and analyze connected data GraphX builds on the massively parallel, fault-tolerant foundation of Spark to provide graph processing Spark provides the ability to complement graph processing with relational processing in a single consistent framework and set of APIs GraphX is a graph processing system and not a database GraphX provides a number of operators and algorithms to facilitate working with and understanding the connections in the data IBM Corporation

39 GraphX Challenges Scala API only No Python or Java APIs Utilizes lower level RDD (vs. DataFrame) based API Does not benefit from Spark DataFrame optimizations such as the Catalyst query optimizer or Tungsten memory management IBM Corporation

Enter Spark GraphFrames DataFrames based graphs for Apache Spark Vertices and edges are represented as DataFrames Enables arbitrary data to be stored with each vertex and edge Python, Java and Scala

40 Enter Spark GraphFrames DataFrames based graphs for Apache Spark Vertices and edges are represented as DataFrames Enables arbitrary data to be stored with each vertex and edge Python, Java and Scala APIs Simplified interactive queries Phrase queries in the familiar, powerful Spark SQL and DataFrame APIs Supports motif finding for structural pattern search For example, to recommend whom to follow, you might search for triplets of users A,B,C where A follows B and B follows C, but A does not follow C Benefits from Spark DataFrame optimizations GraphFrames fully integrate with GraphX via conversions between the two representations IBM Corporation

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data