Hadoop, Spark, Flink, and Beam Explained to Oracle DBAs: Why They Should Care

Size: px

Start display at page:

Download "Hadoop, Spark, Flink, and Beam Explained to Oracle DBAs: Why They Should Care"

Ella Stone
6 years ago
Views:

2 Hadoop, Spark, Flink, and Beam Explained to Oracle DBAs: Why They Should Care Kuassi Mensah Jean De Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017

3 Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 3

4 Speaker Bio - Kuassi Mensah Director of Product Management at Oracle (i) Java integration with the Oracle database (JDBC, UCP, Java in the database) (ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on (iii) JavaScript/Nashorn integration with the Oracle database (DB access, JS stored proc, fluent JS ) MS CS from the Programming Institute of the University of Paris VI Frequent speaker JavaOne, Oracle Open World, Data Summit, Node Summit, Oracle User groups (UKOUG, DOAG,OUGN, BGOUG, OUGF, GUOB, ArOUG, ORAMEX, Sangam,OTNYathra, China, Thailand, etc), Author: Oracle Database Programming using Java and Web 4

5 Speaker Bio Jean de Lavarene Director of Product Development at Oracle (i) Java integration with the Oracle database (JDBC, UCP) (ii) Oracle Datasource for Hadoop (OD4H), upcoming OD for Spark, OD for Flink and so on MS CS from Ecole des Mines, Paris 5

6 Program Agenda From Big Data to Fast Data Apache Hadoop Apache Spark Apache Flink Apache Beam Why Should Oracle DBA Care 6

7 Program Agenda From Big Data to Fast Data Apache Hadoop Apache Spark Apache Flink Apache Beam Why Should Oracle DBA Care 7

8 Data at Rest Big Data analysis Parallel batch processing: Map/Reduce Exponential growth of data volume Projection 44 Zetabytes (44 trillion GB) of data in the digital universe by 2020 Every online user will generate ~1.7 MB of new data per second Need massive-scale infrastructure in the Cloud or on-premise Google Search: 40,000+ search queries per second 8

9 Fast Data Streaming Data Business shift from reactive to proactive interactions Need to process data as it enters the system How fast can you analyze your data and gain insights? Fast data Unbound data, continuous flows of events Need new processing model: stream processing of unbound data New processing frameworks: Spark streaming, Flink, Beam, 9

10 Streaming Data Processing: What, Where When, How? What results are calculated? transformations within the pipeline: sums, histograms, ML models Where in the event time are results calculated? event-time windowing within the pipeline When in processing time are results materialized? the use of watermarks and triggers How do refinements of results relate? type of accumulation used: discarding, accumulating & retracting Read 10

11 Streaming Data Processing Concepts Stream processing: analyze a fragment/window of data stream Windowing: Fixed/tumbling, Sliding, Session Low Latency: Sub-second Timestamp: Event-time, Ingestion time, Processing time -- cf Star Wars Watermark: When a window is considered done (completeness with respect to event times) In-order processing, Out-of-order processing Punctuation: segment a stream into tuples, control signal for operators Triggers: When to materialize the output of the computation watermark event time processing time punctuations Accumulation: disjointed or overlapping results observed for the same window Delivery guarantee: at least once, exactly once, end to end exactly once (guarantee) Event prioritization Backpressure support 11

12 Program Agenda From Big Data to Fast Data Apache Hadoop Apache Spark Apache Flink Apache Beam Why Should Oracle DBA Care 13

13 Apache Hadoop 1.0 First Open-source MapReduce framework & ecosystem Processing model: batch 2004: HDFS + MapReduce (Python) 2006: Apache Hadoop (Java) 2009: 1 TB Sort in 209 sec 2010: 100TB sort in 173 min 2014: 100TB sort in 72 min Cluster Nodes Cluster Nodes Hadoop Cluster (e.g., Big Data Appliance) Mappers intermediate data Reducers 15

14 Apache Hadoop 2.0 Compute & Query Engines MapReduce Hive SQL Impala Spark SQL Mahout (ML libs) Cluster Managmt YARN Data Resources Data HCatalog, InputFormat, StorageHandler Compute Resources + Scheduler HDFS NoSQL External Table Storage Handler RDBMS Redundant Storage 16

database Logical partitioning of the database tables Join Big Data and

15 Oracle Datasource for Hadoop (OD4H) Turn Oracle database tables into Hadoo Datasources Direct, parallel, fast, secure and consistent access to Oracle database Logical partitioning of the database tables Join Big Data and Master Data Write back to Oracle HCatalog StorageHandler InputFormat JDBC Table 17

16 Hadoop: Real-World Use Cases Airbus Uses Big Data Appliance to Improve Flight Testing BAE Systems Choose Big Data Appliance for Critical Projects AMBEV chose Oracle s Big Data Cloud Service to expedite their database integration needs. Big Data Discovery Helps CERN Understand the Universe See more use 18

17 Hadoop: Strengths & Limitations Strengths - Good for batch processing of data-at-rest i.e., Association Rules Mining - Inexpensive disk storage -> can handle enormous datasets Limitations: - Limited to batch processing: not suitable for streaming data processing - Static partitioning - Materialization on each job step - Complex processing requires multi-staging - Disk-based operations prevents data sharing for interactive ad-hoc queries 19

18 Program Agenda From Big Data to Fast Data Apache Hadoop Apache Spark Apache Flink Apache Beam Why Should Oracle DBA Care 20

19 Apache Spark Summary 2009: AMPLab -> based on micro batching; for batch and streaming proc. Sort 100 TB 3X faster than Hadoop MapReduce on 1/10th platform RDD: fault tolerance abstraction for in-memory data sharing. Pair RDDs as key/value pairs Spark Streaming: for real-time streaming data processing DataSource API: over Avro, JSON, CSV, Parquet and JDBC (built in Hive) Dataframe: high level concept on top of RDD; equivalent to a table Spark Applications: RDD or DataFrame or ML APIs in Scala, Python, Java Spark Shell: interactive command line for Scala & Python Spark SQL: Relational processing, operates on data sources thru the dataframe interface. 21

20 Apache Spark - Core Architecture Spark SQL Spark Streaming MLib GraphX DataFrame API Spark Core Data Source API Cluster Manager/Scheduler: Mesos or YARN or Standalone RDBMS 22

21 Spark Core Architecture 23

22 How Apache Spark Works Create a dataset from external data, then apply parallel operations to it; all work expressed as - transformations : creating new RDDs, or transforming existing RDDs - actions: calling operations on RDDs Execution plan as a Directed Acyclic Graph (DAG) of operations Every Spark program and shell session work as follows 0. Spark Context: main entry point in Spark APIs 1. Create some input RDDs from external data. 2. Transform them to define new RDDs using transformations like filter(). 3. Ask Spark to persist any intermediate RDDs that will need to be reused. 4. Launch actions such as count() and first() to kick off a parallel computation, which are optimized and executed by Spark executor. 24

23 Basic Spark Example (Python) Lines =spark.textfiles( hdfs:// ) HadoopRDD Errors = Lines.filter(_.startswith( ERROR ) messages = Errors.map(_.split( \t ) (2)) messages.persist() FilteredRDD MappedRDD HDFS HadoopRDD FilteredRDD MappedRDD 25

24 Spark Workflow Worker Node Spark Driver Executes User Application RDD Objects Spark Context DAG operator DAG Scheduler (what to run, split graph into tasks ) Cluster Manager (Task Scheduler) Mesos or TaskSet YARN or Standalone Executor Cache Task Task Worker Node Executor Cache Task Task Data Node Data Node 26

25 Spark Real World Use Cases More than 1000 organizations are using Spark in production The largest known Spark cluster has 8000 nodes Security, finance: fraud/intrusion detection, risk-based authentication Log processing, BI/reporting/ETL Mobile usage patterns analysis Predictive analytics, data exploration Game industry: real-time discovering of patterns in-game events e-commerce: real-time Tx using streaming clustering algorithms And so on! 27

26 Spark Strengths and Limitations Strengths - Speed: in-memory processing (RDD allow in-memory data sharing) - High throughput - Correct under stress: strongly consistent - Supports event processing-time (in-order processing) - Spark Streaming: sub-second buffering increments Limitations - Latency of micro-batch (batch first) - Inability to fit windows to naturally occurring events - Supports only tumbling/sliding windows - No event-time windowing (out-of-order processing) - No watermarks support - Triggers: at the end of the window only with Spark 28

27 Program Agenda From Big Data to Fast Data Apache Hadoop Apache Spark Apache Flink Apache Beam Why Should Oracle DBA Care 29

28 Apache Flink 2009: real time, high performance, very low latency streaming Single runtime for both streaming and batch processing Continuous flow: processes data when it comes Pipelined execution is faster Batch on bounded stream (special case) Correct state upon failure; correct time/window semantics Supports Event-Time and Out-of-Order Events Own Memory management; no reliance on JVM GC -> no spike 30

29 Apache Flink Architecture 31

30 Apache Flink Application and Dataflow 33

31 Apache Flink Workflow Client Optimization, job graph, pass graph to job manager Job manager (Master) Parallelization, creates execution graph, assign tasks to task managers Task manager (Worker) Client Job Manager Task Manager Task Manager 34

32 Apache Flink Real World Use Cases Advertizing: real-time one/one targetting Financial Services: real-time fraud detection Retail: smart logistics, real-time monitoring of items and delivery Healthcare: smart hospitals, biometrics Telecom: real-time service optimization and billing based on location and usage Oil and Gaz: real-time monitoring of rigs and pumps 35

33 Flink Strengths and Limitations Strengths Stream-first: low latency, high throughput own memory management no reliance on JVM GC Self-driven Limitations Maturity Little large scale deployments 36

34 Program Agenda From Big Data to Fast Data Apache Hadoop Apache Spark Apache Flink Apache Beam Why Should Oracle DBA Care 37

35 Beam Model: Portability Across Big Data Engines Courtesy Google Next 17 Portable and Parallel Data Processing Language A SDK (Java) Language B SDK (Python soon) Language C SDK (TBD) Beam API & Prog Model Languages: Java, Python The Beam Model Engine: Spark, Flink, Dataflow, Apex Deployment : Cloud or on-prem Runner 1 Apache Spark Runner 2 Apache Flink Runner 3 Google Cloud Dataflow 38

36 Apache Beam Eco-system 39

37 Program Agenda From Big Data to Fast Data Apache Hadoop Apache Spark Apache Flink Apache Beam Why Should Oracle DBA Care 45

38 The Oracle DBA s Scope Oracle MySQL NoSQL Other DBMS 46

39 Your Data Center: On-Premises & Cloud Oracle Big & Fast Data MySQL Other DBMS NoSQL HDFS 47

40 Career Move Expanding your Territory Oracle Big Data MySQL Other DBMS NoSQL HDFS 48

41 Career Move Data Architect, Chief Data Officer Big Data Administrator, Data Architect Manages the Big Data Clusters Monitors data and network traffic, prevents glitches Integrates, centralizes, protect s and maintains data sources Grants and revokes permissions to various clients and nodes. Chief Data Officer Responsible for the overall data strategy within an organization Accountable for whatever data is collected, stored, shared, sold or analyzed as well as how the data is collected, stored, shared, sold or analyze Ensures that the data is implemented correctly, securely and comply with customers privacy, data privacy, government and ethical policies Defines company standards and policies for data operation, data accountability, and data quality 49

42 Roadmap to Big Data Architect and Chief Data Officer Get your hands on Big Data platform, Cloud services or VMs e.g., Oracle BDALite Vbox, Oracle Big Data Cloud Services, Oracle BDA Leverage your Oracle background and notions: clusters, nodes, Oracle SQL (Big Data SQL), Big Data Connectors (e.g., Oracle Datasource for Hadoop) Get familiar with Big Data databases & storages: HDFS, NoSQL, DBMSes Get familiar with key Big Data Frameworks: Hadoop, Spark, Flink, Beam, streaming frameworks (Kafka, Storm), and their integration with Oracle (OD4H, OD4S, and so on) Get familiar with Big Data tools and programming: Hive SQL, Spark SQL, visualization tools, R, Java, Scala, and so on Read, Practice and Get involved in Big Data projects 50

43 Key Takeaways Big Data is growing exponentially This is the era of Fast Data requiring new processing models Hadoop is good for some use cases but cannot handle streaming data Spark brings in-memory processing and data abstraction (RDD, etc) and allows real-time processing of streaming data however its micro batch architecture incurs high latency Flink brings low latency and promise to address Spark limitations DBA should embrace Big Data frameworks and expand their skills and coverage within the data center or in the Cloud. 51

44 Resources Apache Projects Big Data in the Cloud, Big Data Compute Edition Big Data Connectors 52

45 53

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following