Turning Relational Database Tables into Spark Data Sources

Size: px

Start display at page:

Download "Turning Relational Database Tables into Spark Data Sources"

Linette Pierce
5 years ago
Views:

3 Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04,

4 Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 4

5 Speaker Bio - Kuassi Mensah Director of Product Management at Oracle (i) Java integration with the Oracle database (JDBC, UCP, Java in the database) (ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on (iii) JavaScript/Nashorn integration with the Oracle database (DB access, JS stored proc, fluent JS ) MS CS from the Programming Institute of University of Paris Frequent speaker JavaOne, Oracle Open World, Data Summit, Node Summit, Oracle User groups (UKOUG, DOAG,OUGN, BGOUG, OUGF, GUOB, ArOUG, ORAMEX, Sangam,OTNYathra, China, Thailand, etc), Author: Oracle Database Programming using Java and Web

6 Speaker Bio Jean de Lavarene Director of Product Development at Oracle (i) Java integration with the Oracle database (JDBC, UCP) (ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on MS CS from Ecole des Mines, Paris

7 Program Agenda Requirements Apache Spark RDBMS Table as Spark Datasource Performance, Scalability and Security Optimizations Demo and Wrap up 7

8 Program Agenda Requirements Apache Spark RDBMS Table as Spark Datasource Performance, Scalability and Security Optimizations Demo and Wrap up 8

9 Requirements 9

10 Big Data Analytics and Requirements Goal: furnish actionable information to help business decisions making. Example Which of our products got a rating of four stars or higher, on social media in the last quarter? Master Data (RDBMS) Big Data HDFS, NoSQL Copyright 2017, Oracle and/or its affiliates. All rights reserved. 10

11 Program Agenda Requirements and Motivations Apache Spark RDBMS Table as Spark Datasource Performance, Scalability and Security Optimizations Demo and Wrap up 11

12 Apache Spark 12

13 Apache Spark - Core Architecture Spark SQL Spark Streaming MLib GraphX DataFrame API Spark Core Data Source API Cluster Manager/Scheduler: Mesos or YARN or Standalone RDBMS 13

14 Apache Spark Concepts Processes data in memory RDD: fault tolerance abstraction for in-memory data sharing - Immutable and partitioned datasets; user controlled partitioning & persistance - Coarse-grained transformations of one RDD into another - Store the graph of transformations lineage, can be re-constructed in case of failure - Ensure exactly-once processing Dataframe conceptually equivalent to a table in a relational database; allows running Spark- SQL queries over its data. 14

15 Apache Spark Summary Can process data in HDFS, HBase, Cassandra, Hive, Hadoop InputFormat. DataSource API: built-in support for Hive, Avro, JSON, Parquet and JDBC. Spark SQL: operates on a variety of data sources through the dataframe interface. Can run in Hadoop clusters through YARN or Spark's standalone mode Spark Streaming: for real-time streaming data processing; based on micro batching. 15

16 Apache Spark Data Points Spark apps on Hadoop clusters can run up to 100 times faster in memory and 10 times faster on disk. Sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the machines The largest known Spark cluster has 8000 nodes. More than 1000 organizations are using Spark in production 16

17 How Apache Spark Works Create a dataset from external data, then apply parallel operations to it; all work expressed as - transformations : creating new RDDs, or transforming existing RDDs - actions: calling operations on RDDs Execution plan as a Directed Acyclic Graph (DAG) of operations Every Spark program and shell session work as follows 0. Spark Context: main entry point in Spark APIs 1. Create some input RDDs from external data. 2. Transform them to define new RDDs using transformations like filter(). 3. Ask Spark to persist any intermediate RDDs that will need to be reused. 4. Launch actions such as count() and first() to kick off a parallel computation, which are optimized and executed by Spark executor. Copyright 2017, Oracle and/or its affiliates. All rights reserved. 17

18 Basic Spark Example (Python) Lines =spark.textfiles( hdfs:// ) HadoopRDD Errors = Lines.filter(_.startswith( ERROR ) FilteredRDD messages = Errors.map(_.split( \t ) (2)) MappedRDD messages.persist() HDFS HadoopRDD FilteredRDD MappedRDD Copyright 2017, Oracle and/or its affiliates. All rights reserved. 18

19 Spark Workflow Worker Node Executor Spark Driver Executes User Application RDD Objects Spark Context DAG operator DAG Scheduler (what to run, split graph into tasks ) TaskSet Cluster Manager (Task Scheduler) Mesos or YARN or Standalone Cache Task Task Worker Node Executor Cache Task Task Data Node Data Node Copyright 2017, Oracle and/or its affiliates. All rights reserved. 19

20 Spark Streaming 20

21 Streaming Data Processing: What, Where When, How What results are calculated? -> transformations within the pipeline: sums, histograms, ML models Where in the event time are results calculated? -> event-time windowing within the pipeline: When in processing time are results materialized? -> the use of watermarks and triggers How do refinements of results relate? -> type of accumulation used: discarding, accumulating & retracting Copyright 2017, Oracle and/or its affiliates. All rights reserved. 21

22 Streaming Data Processing Concepts Stream processing: analyze a fragment/window of data stream Low Latency: Sub-second Windowing: Fixed/tumbling, Sliding, Session Timestamp: Event-time, Ingestion time, Processing time -- cf Star Wars Watermark: When a window is considered done (input completeness with respect to event times) In-order processing, Out-of-order processing Punctuation: segment a stream into tuples, control signal for operators Triggers: When to materialize the output the computation = watermark event time processing time punctuations Accumulation: disjointed or overlapping results observed for the same window Delivery guarantee: at least once, exactly once, end to end exactly once (guarantee) Event prioritization Backpressure support 22

23 Program Agenda Requirements and Motivations Apache Spark RDBMS Table as Spark Datasource Performance, Scalability and Security Optimizations Demo and Wrap up 23

24 RDBMS Table as Spark Datasource 24

25 Spark SQL A library including the following APIs and services The Data Source API a universal API for loading and saving structured data The DataFrame API produces a distributed collection of data organized into named columns The SQL Interpreter and Optimizer The SQL Service a Hive Thrift server Dataframe DSL Spark SQL HQL Dataframe API Datasource API RDBMS JDBC 25

26 Plain JDBC or RDBMS Connector JDBCRDD + schema = Dataframe DataFrame df = sqlcontext.read().format("jdbc"). options(options).load(); Scala val jdbcdf = sqlcontext.read.format("jdbc").options(map("url" -> "jdbc:<rdbms>", "dbtable" -> "schema.tablename")).load() JDBC Java Map<String, String> options = new HashMap<String, String>(); options.put("url", "jdbc:<rdbms>"); options.put("dbtable", "schema.tablename"); Table Plain JDBC supports Predicate pushdown and basic partitioner DBMS/RDBMS Connectors furnish more optimizations Copyright 2017, Oracle and/or its affiliates. All rights reserved. 26

27 Program Agenda Requirements and Motivations Apache Spark RDBMS Table as Spark Datasource Performance and Scalability Optimizations Demo and Wrap up 27

28 Performance, Scalability and Security Optimizations 28

29 Optimizations in DBMSses Connectors Many DBMS/RDBMS vendors furnish their own connectors where they implement optimizations not available in the Spark plain JDBC datasource Potential optimizations in the Oracle implementation Custom Partitioners or Splitters Partition pruning Fast JDBC types conversion Connection properties e.g., fetch size Connection caching Strong authentication, encryption and integrity 29

30 Partitioners Controls the number of parallel tasks to run against the RDBMS table Efficient logical table partitioning Generate RDBMS SQL queries for each partition Worker Node Executor Cache Task Task Worker Node Executor Cache Task Task Oracle impl JDBC Table 30

31 Probable Partitioners for the Oracle Database See the similar split definitions in SINGLE_SPLITTER: no parallelism, the whole table as a single unit ROW_SPLITTER: create several splits based on row count BLOCK_SPLITTER: create several splits based on block count PARTITION_SPLITTER: align splits on the table partitions i.e., 1 split per table partition. 31

32 Dataframe Creation SQLContext: entry point into all functionality in Spark SQL $ spark-shell --jars file:///home/spark/od4s/jlib/ojdbc7.jar, file:///home/spark/od4s/jlib/ucp.jar, file:///home/spark/od4s/jlib/... scala> val df = sqlcontext.read.format( the oracle spark datasource").option("url","jdbc:oracle:thin:@localhost:1521/pdb1.localdomain ").option("driver", "oracle.jdbc.oracledriver").option("dbtable", "EmployeeData").option("user", "hr").option("password", "hr").option("oracle.jdbc.spark.partitionertype","block_splitter").option("oracle.jdbc.spark.maxpartitions","4").load() 32

33 Dataframe Operations scala> df.load() -> only creates the dataframe scala> df.show -> rows are fetched scala> df.count() scala> df.printschema scala> df.filter("emp_id = 79272").show scala> df.first() scala> df.select("emp_id", "JOB_TITLE").show scala> df.filter("salary < 56000").show scala> df.select("emp_id", "JOB_TITLE").show 33

34 Other Optimizations Fast JDBC types conversion Connection properties e.g., fetch size Connection caching Strong authentication, encryption and integrity 34

35 Program Agenda Requirements and Motivations Apache Spark RDBMS Table as Spark Datasource Performance and Scalability Optimizations Demo and Wrap up 35

36 Demo and Wrap-up 36

37 37

Hadoop, Spark, Flink, and Beam Explained to Oracle DBAs: Why They Should Care

Hadoop, Spark, Flink, and Beam Explained to Oracle DBAs: Why They Should Care Kuassi Mensah Jean De Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 Safe Harbor