Big Data Applications with Spring XD

Size: px

Start display at page:

Download "Big Data Applications with Spring XD"

Alannah Marian Chapman
5 years ago
Views:

Big Data Applications with Spring XD Thomas Darimont, Software Engineer, Pivotal Inc.

1 Big Data Applications with Spring XD Thomas Darimont, Software Engineer, Pivotal Unless otherwise indicated, these slides are Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial llicense:

2 THE FASTEST PATH TO NEW BUSINESS VALUE

3 Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these slides are Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license:

4 Introduction 4

5 Spring XD - Overview Platform for Big Data Applications Ingestion, Processing, Movement, Analytics Stream and Batch Processing Scalable Distributed Runtime Support for Deep Analytics Proven Spring Technologies 5

6 Spring XD - Why yet another Big Data Platform? Alternative to Frameworks like Flume, Oozie, Sqoop, Storm Just one Platform instead of many Common things easy, complex things possible Complementary to many technologies Big SQL / MPP Databases - Impala, HAWQ Stream Processing - Apache Spark NoSQL DataStores - Cassandra, MongoDB 6

7 extreme X Data D Spring XD - one stop shop for developing and deploying Big Data Apps 7

8 Spring XD - 10,000 Foot View >_ Rest Spring XD Runtime taps Streams ingest BIDIRECTIONAL Jobs workflow RDBMS Redis Compute NoSQL HDFS export Predictive Modelling R, SAS 8

9 Spring XD - Easy to Setup and Run Store incoming HTTP data into HDFS 9

10 Spring XD - Easy to Setup and Run 1. Install via package manager / unzip 2. Start $ xd-singlenode $ xd-shell 3. Define xd:> stream create ingest --definition http hdfs 4. Run xd:> stream deploy ingest Yes, writing HTTP Data to HDFS can be that simple! 10

11 Core Concepts 11

12 Spring XD - Core Concepts Runtime Modules Streams Taps Analytics Jobs Extensibility Deployment 12

13 Spring XD - Runtime Hosts Stream Processing & Batch Workflows & Analytics Manages Component Distribution Communication via MessageBus Additional Services Configuration / Cluster State: ZooKeeper Analytics: Redis, In-Memory Message Bus: Redis, RabbitMQ, Kafka, Local 13

14 Spring XD - Instance Types XD-Admin Assigns Modules to Containers Manages Cluster Failover & HA XD-Container Loads / Executes Modules Connects to Data Bus Standalone, YARN, Cloud Foundry XD UI XD Shell XD XD Admin Admin Leader XD Admin Leader Leader XD Container module module module module Batch Job State DB Analytics Repository ZK XD Container module module module module Kafka/RabbitMQ/Redis 14

15 Spring XD - Runtime Modes XD Admin XD Admin JVM ZK DB MB ZK JVM DB JVM MB single-node standalone XD Container Module JVM multi-node distributed XD Container Module JVM XD Container Module JVM Development Production 15

16 Spring XD - Distributed Runtime XDA XDA deploy Zookeeper XDC time XDC log XDC XDA = XD Admin XDC = XD Container bind Message Bus 16

17 Modules 17

18 Modules Unit of execution Source, Sink, Processor, Jobs Defined in XML or JVM Language Spring config file with Spring Bean Definitions Can have Parameters 50+ already included in XD Define new Modules via Composition 18

Modules - Overview HTTP SFTP Tail File Mail Syslog TCP / Source TCP Client Reactor IP #20 JMS RabbitMQ Time MQTT Mongo Kafka JDBC Gemfire CQ, Source Twitter Search, Stream Stdout Capture Filter

19 Modules - Overview HTTP SFTP Tail File Mail Syslog TCP / Source TCP Client Reactor IP #20 JMS RabbitMQ Time MQTT Mongo Kafka JDBC Gemfire CQ, Source Twitter Search, Stream Stdout Capture Filter Transform Splitter Aggregator HTTP Client Processor Shell Command Script #13 Groovy Python Java JPMML-Evaluator JSON-to-Tuple Object-to-JSON Log File JDBC TCP MQTT Mongo Sink Mail Null Sink #20 Redis RabbitMQ HDFS HDFS Dataset Shell Command GemFire Server Splunk Server Dynamic Router Counter + 1 Gauge

20 Streams 20

21 Streams Programming model for real-time processing How data is collected, processed, and stored or forwarded DSL analog to Unix Pipes and Filters Source Processor 0 * Sink Data is pumped through MessageBus Spring Integration Components Stream Source Message Bus Processor Sink 21

22 Streams - Example Transform payload incoming from HTTP to uppercase and send to log stream create test1 --definition "http transform --expression=payload.touppercase() log --deploy Source Processor Option Sink 22

23 Taps 23

24 Taps Special type of Stream Consume data along the processing pipeline Original stream stays unaffected Collect metrics and perform analytics Stream Source Processor Sink Processor Sink Message Bus Tap 24

25 Taps Example First create the stream stream create test1 definition "http transform --expression=payload.touppercase() log --deploy Then create the tap: onto transform stage, add prefix and send to log stream create test1tap --definition tap:stream:test1.transform > transform --expression='tapped: '+payload log --deploy Tap Source Redirection 25

26 Analytics 26

27 Analytics Counters Simple Counter - how many tweets? Field Value Counter - how many for tag=#java? Aggregate Counter - how many tweets for #java per time interval? Gauges Gauge - what was the last seen value? Rich Gauge - what was the last seen value/avg/min/max? Backed by Redis, In-Memory via Spring Data Repositories Accessible via XD-Shell and REST API on XD-Admin 27

28 Advanced Analytics Processor Modules Python: numpy, pandas, scikit-learn, NLTK, SimpleCV Shell: R-Project rscript, OpenCV Java / Groovy PMML Processor Module Predictive Model Markup Language Description of Parameterised Data Mining Models Allows to Operationalise Predictive Models Real-time evaluation and scoring 28

29 Jobs 29

30 Jobs Programming Model for Batch Processing Create, Schedule, Execute and Monitor Spring Batch and Spring Hadoop Components CSV to JDBC FTP to Jobs HDFS JDBC to HDFS #5 HDFS to JDBC HDFS to MongoDB 30

31 Jobs - Example Create job from existing job definition job create --name "helloworld-job" --definition helloworld" --deploy Run job once job launch --name "helloworld-job" Run job periodically stream create --name "hw-cron" --definition "trigger --cron='0/5 * * * * *' > queue:job:helloworld-job deploy 31

32 Management 32

33 Spring XD - Shell CLI based on Spring Shell Manages Streams, Jobs, Analytics and Deployment Completion / Assist Many built-in Commands try help Started via xd-shell 33

34 Spring XD - Admin UI Management Interface accessible from XD-Admin Node XD-ADMIN:9393/admin-ui 34

35 Spring XD - REST Interface accessible from XD-Admin Node used by XD-Shell and Admin-UI 35

36 Extensibility 36

37 Extensibility Custom Modules Source, Sink, Processor, Job Spring Integration, Spring Batch Self-contained fat-jar E.g. to wrap a Java Library Upload new modules via XD-Shell / REST Register custom Spring Expression Language Aliases from java.lang.double.parsedouble(payload.sensorvalue) to #parsedouble(payload.sensorvalue) Scripts Collection of XD commands Automation 37

38 Deployment 38

39 Deployment deploy or --deploy stream deploy firststream stream create secondstream --deploy Deployment Manifest Customize via --properties Parameter Control # of Module Instances Define Target Server or Group Direct Binding Stream Data Partitioning 39

40 Deployment Manifest - Module Count http worker hdfs stream deploy --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 http http worker worker worker hdfs hdfs hdfs worker 40

41 Deployment Manifest - Module Placement http worker hdfs stream deploy WEB worker --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 module.http.criteria= group.contains( WEB ) http http worker worker worker hdfs hdfs hdfs xd/bin/xd-container --groups="web" 41

42 Deployment Manifest - Data Partitioning http worker hdfs stream deploy WEB 0 worker --properties module.worker.count=4, module.http.producer.partitionkeyexpression= payload.customerid http http worker worker worker hdfs hdfs hdfs partition := hash(payload.customerid) % worker.count 42

43 Applications 43

44 Spring XD - Measuring Live Usage for a Major Sports League Measuring live video usage through mobile applications 44

45 Spring XD - IoT Connected Car Journey and Range Prediction 45

46 Spring XD - Smartgrid ACM Distributed Event Based Systems 2014 Scalable, Real-Time Analytics, High Volume Sensor Data Short-Term Load Forecasting in a Power Grid Sensor Data from Smart Plugs Stream Components Sensor Data Ingestion Data Aggregation Load Prediction Demo Analytics via REST 46

47 What s next? 47

48 Roadmap and beyond Custom Modules in HDFS More OOTB Modules Web based Editor for Streams & Jobs Apache Ambari Support Security Enhancements Spring XD on Pivotal Cloud Foundry Currently in Beta GA Release Planned for

49 Learn more Project GitHub Wiki Samples Modules JIRA Stackoverflow 49

through online (stream) and offline (batch) data Data Ingestion, Processing, Movement, Analytics

50 Spring XD - Takeaway Increased Productivity through out-of-the-box components Unified runtime for both Real-time and Batch use cases Scalable, Distributed and Fault Tolerant Runtime Closed Loop Analytics through online (stream) and offline (batch) data Data Ingestion, Processing, Movement, Analytics Swiss-army knife of data movement and data pipelines Repeatable turnkey solution for next generation data-centric use cases 50

51 Learn More. Stay Connected. Twitter: twitter.com/springcentral YouTube: spring.io/video LinkedIn: spring.io/linkedin Google Plus: spring.io/gplus 51 Unless otherwise indicated, these slides are Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license:

52 Backup Slides 52

53 Lambda Architecture 53

54 Lambda Architecture

55 Lambda Architecture - Spring XD Gemfire XD> Spring Stream Processing Serving Layer Speed Layer Real-time Views Spring Boot Batch Processing Workflow Orchestration Ingest Data Lake Spring Boot HAWQ Spring Boot Export Analytics Batch Layer Predictive Analytics Batch Views Spring Boot

56 Predictive Models Model Parameterised Algorithm Model Building Derive a parameterised algorithm from the data Slow process Usually large data volume -> done offline as a batch process Model Scoring Use the model to predict new information Fast process Can be done as part of stream processing 56

PMML Predictive Model Markup Language Open Standard Maintained by Data Mining Group (DMG) XML based DSL for predictive models Can be interpreted 15 Model Types (Naive Bayes,

57 PMML Predictive Model Markup Language Open Standard Maintained by Data Mining Group (DMG) XML based DSL for predictive models Can be interpreted 15 Model Types (Naive Bayes, General Regression, Neural Networks, etc.) First Version (1999) Current Version Lingua Franca for Predictive Models Bridge the Gap between Data Scientists and Engineers 57

Processing Post Processing Transform model output

58 Anatomy of a PMML Model Predictive Model Algorithm description(s) Parameterisation trained model Pre Processing Post Processing Transform model output Thresholds / Business rules Source:(PMML(in(Ac/on,(2 nd (Edi/on,(2012,(p.(7. 58

59 Predictive Analytics with Spring XD XD Module analytic-pmml Introduced in Spring M6 (April 2014) Real-time evaluation and scoring Based on JPMML-Evaluator Wide range of Model types spring-xd-modules/analytics-ml-pmml on Github

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data