Big Data Applications with Spring XD

Big Data Applications with Spring XD Thomas Darimont, Software Engineer, Pivotal Inc. @thomasdarimont Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial llicense: http://creativecommons.org/licenses/by-nc/3.0/

THE FASTEST PATH TO NEW BUSINESS VALUE

Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Introduction 4

Spring XD - Overview Platform for Big Data Applications Ingestion, Processing, Movement, Analytics Stream and Batch Processing Scalable Distributed Runtime Support for Deep Analytics Proven Spring Technologies 5

Spring XD - Why yet another Big Data Platform? Alternative to Frameworks like Flume, Oozie, Sqoop, Storm Just one Platform instead of many Common things easy, complex things possible Complementary to many technologies Big SQL / MPP Databases - Impala, HAWQ Stream Processing - Apache Spark NoSQL DataStores - Cassandra, MongoDB 6

extreme X Data D Spring XD - one stop shop for developing and deploying Big Data Apps 7

Spring XD - 10,000 Foot View >_ Rest Spring XD Runtime taps Streams ingest BIDIRECTIONAL Jobs workflow RDBMS Redis Compute NoSQL HDFS export Predictive Modelling R, SAS 8

Spring XD - Easy to Setup and Run Store incoming HTTP data into HDFS 9

Spring XD - Easy to Setup and Run 1. Install via package manager / unzip 2. Start $ xd-singlenode $ xd-shell 3. Define xd:> stream create ingest --definition http hdfs 4. Run xd:> stream deploy ingest Yes, writing HTTP Data to HDFS can be that simple! 10

Core Concepts 11

Spring XD - Core Concepts Runtime Modules Streams Taps Analytics Jobs Extensibility Deployment 12

Spring XD - Runtime Hosts Stream Processing & Batch Workflows & Analytics Manages Component Distribution Communication via MessageBus Additional Services Configuration / Cluster State: ZooKeeper Analytics: Redis, In-Memory Message Bus: Redis, RabbitMQ, Kafka, Local 13

Spring XD - Instance Types XD-Admin Assigns Modules to Containers Manages Cluster Failover & HA XD-Container Loads / Executes Modules Connects to Data Bus Standalone, YARN, Cloud Foundry XD UI XD Shell XD XD Admin Admin Leader XD Admin Leader Leader XD Container module module module module Batch Job State DB Analytics Repository ZK XD Container module module module module Kafka/RabbitMQ/Redis 14

Spring XD - Runtime Modes XD Admin XD Admin JVM ZK DB MB ZK JVM DB JVM MB single-node standalone XD Container Module JVM multi-node distributed XD Container Module JVM XD Container Module JVM Development Production 15

Spring XD - Distributed Runtime XDA XDA deploy Zookeeper XDC time XDC log XDC XDA = XD Admin XDC = XD Container bind Message Bus 16

Modules 17

Modules Unit of execution Source, Sink, Processor, Jobs Defined in XML or JVM Language Spring config file with Spring Bean Definitions Can have Parameters 50+ already included in XD Define new Modules via Composition 18

Modules - Overview HTTP SFTP Tail File Mail Syslog TCP / Source TCP Client Reactor IP #20 JMS RabbitMQ Time MQTT Mongo Kafka JDBC Gemfire CQ, Source Twitter Search, Stream Stdout Capture Filter Transform Splitter Aggregator HTTP Client Processor Shell Command Script #13 Groovy Python Java JPMML-Evaluator JSON-to-Tuple Object-to-JSON Log File JDBC TCP MQTT Mongo Sink Mail Null Sink #20 Redis RabbitMQ HDFS HDFS Dataset Shell Command GemFire Server Splunk Server Dynamic Router Counter + 1 Gauge + 1 19

Streams 20

Streams Programming model for real-time processing How data is collected, processed, and stored or forwarded DSL analog to Unix Pipes and Filters Source Processor 0 * Sink Data is pumped through MessageBus Spring Integration Components Stream Source Message Bus Processor Sink 21

Streams - Example Transform payload incoming from HTTP to uppercase and send to log stream create test1 --definition "http transform --expression=payload.touppercase() log --deploy Source Processor Option Sink 22

Taps 23

Taps Special type of Stream Consume data along the processing pipeline Original stream stays unaffected Collect metrics and perform analytics Stream Source Processor Sink Processor Sink Message Bus Tap 24

Taps Example First create the stream stream create test1 definition "http transform --expression=payload.touppercase() log --deploy Then create the tap: onto transform stage, add prefix and send to log stream create test1tap --definition tap:stream:test1.transform > transform --expression='tapped: '+payload log --deploy Tap Source Redirection 25

Analytics 26

Analytics Counters Simple Counter - how many tweets? Field Value Counter - how many for tag=#java? Aggregate Counter - how many tweets for #java per time interval? Gauges Gauge - what was the last seen value? Rich Gauge - what was the last seen value/avg/min/max? Backed by Redis, In-Memory via Spring Data Repositories Accessible via XD-Shell and REST API on XD-Admin 27

Advanced Analytics Processor Modules Python: numpy, pandas, scikit-learn, NLTK, SimpleCV Shell: R-Project rscript, OpenCV Java / Groovy PMML Processor Module Predictive Model Markup Language Description of Parameterised Data Mining Models Allows to Operationalise Predictive Models Real-time evaluation and scoring 28

Jobs 29

Jobs Programming Model for Batch Processing Create, Schedule, Execute and Monitor Spring Batch and Spring Hadoop Components CSV to JDBC FTP to Jobs HDFS JDBC to HDFS #5 HDFS to JDBC HDFS to MongoDB 30

Jobs - Example Create job from existing job definition job create --name "helloworld-job" --definition helloworld" --deploy Run job once job launch --name "helloworld-job" Run job periodically stream create --name "hw-cron" --definition "trigger --cron='0/5 * * * * *' > queue:job:helloworld-job deploy 31

Management 32

Spring XD - Shell CLI based on Spring Shell Manages Streams, Jobs, Analytics and Deployment Completion / Assist Many built-in Commands try help Started via xd-shell 33

Spring XD - Admin UI Management Interface accessible from XD-Admin Node XD-ADMIN:9393/admin-ui 34

Spring XD - REST Interface accessible from XD-Admin Node used by XD-Shell and Admin-UI http://xd-admin:9393 35

Extensibility 36

Extensibility Custom Modules Source, Sink, Processor, Job Spring Integration, Spring Batch Self-contained fat-jar E.g. to wrap a Java Library Upload new modules via XD-Shell / REST Register custom Spring Expression Language Aliases from java.lang.double.parsedouble(payload.sensorvalue) to #parsedouble(payload.sensorvalue) Scripts Collection of XD commands Automation 37

Deployment 38

Deployment deploy or --deploy stream deploy firststream stream create secondstream --deploy Deployment Manifest Customize via --properties Parameter Control # of Module Instances Define Target Server or Group Direct Binding Stream Data Partitioning 39

Deployment Manifest - Module Count http worker hdfs stream deploy --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 http http worker worker worker hdfs hdfs hdfs worker 40

Deployment Manifest - Module Placement http worker hdfs stream deploy WEB worker --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 module.http.criteria= group.contains( WEB ) http http worker worker worker hdfs hdfs hdfs xd/bin/xd-container --groups="web" 41

Deployment Manifest - Data Partitioning http worker hdfs stream deploy WEB 0 worker --properties module.worker.count=4, module.http.producer.partitionkeyexpression= payload.customerid http http 1 2 3 worker worker worker hdfs hdfs hdfs partition := hash(payload.customerid) % worker.count 42

Applications 43

Spring XD - Measuring Live Usage for a Major Sports League Measuring live video usage through mobile applications 44

Spring XD - IoT Connected Car Journey and Range Prediction 45

Spring XD - Smartgrid ACM Distributed Event Based Systems 2014 Scalable, Real-Time Analytics, High Volume Sensor Data Short-Term Load Forecasting in a Power Grid Sensor Data from Smart Plugs Stream Components Sensor Data Ingestion Data Aggregation Load Prediction Demo Analytics via REST 46

What s next? 47

Roadmap - 1.2 and beyond Custom Modules in HDFS More OOTB Modules Web based Editor for Streams & Jobs Apache Ambari Support Security Enhancements Spring XD on Pivotal Cloud Foundry Currently in Beta http://docs.pivotal.io/spring-xd GA Release Planned for 2015 48

Learn more Project http://projects.spring.io/spring-xd GitHub https://github.com/spring-projects/spring-xd Wiki http://docs.spring.io/spring-xd/docs/current/reference/html/ Samples https://github.com/spring-projects/spring-xd-samples Modules https://github.com/spring-projects/spring-xd-modules JIRA https://jira.spring.io/browse/xd Stackoverflow http://stackoverflow.com/questions/tagged/spring-xd 49

Spring XD - Takeaway Increased Productivity through out-of-the-box components Unified runtime for both Real-time and Batch use cases Scalable, Distributed and Fault Tolerant Runtime Closed Loop Analytics through online (stream) and offline (batch) data Data Ingestion, Processing, Movement, Analytics Swiss-army knife of data movement and data pipelines Repeatable turnkey solution for next generation data-centric use cases 50

Learn More. Stay Connected. Twitter: twitter.com/springcentral YouTube: spring.io/video LinkedIn: spring.io/linkedin Google Plus: spring.io/gplus 51 Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Backup Slides 52

Lambda Architecture 53

Lambda Architecture

Lambda Architecture - Spring XD Gemfire XD> Spring Stream Processing Serving Layer Speed Layer Real-time Views Spring Boot Batch Processing Workflow Orchestration Ingest Data Lake Spring Boot HAWQ Spring Boot Export Analytics Batch Layer Predictive Analytics Batch Views Spring Boot

Predictive Models Model Parameterised Algorithm Model Building Derive a parameterised algorithm from the data Slow process Usually large data volume -> done offline as a batch process Model Scoring Use the model to predict new information Fast process Can be done as part of stream processing 56

PMML Predictive Model Markup Language Open Standard Maintained by Data Mining Group (DMG) XML based DSL for predictive models Can be interpreted 15 Model Types (Naive Bayes, General Regression, Neural Networks, etc.) First Version (1999) Current Version 4.2.1 Lingua Franca for Predictive Models Bridge the Gap between Data Scientists and Engineers 57

Anatomy of a PMML Model Predictive Model Algorithm description(s) Parameterisation trained model Pre Processing Post Processing Transform model output Thresholds / Business rules Source:(PMML(in(Ac/on,(2 nd (Edi/on,(2012,(p.(7. 58

Predictive Analytics with Spring XD XD Module analytic-pmml Introduced in Spring 1.0.0 M6 (April 2014) Real-time evaluation and scoring Based on JPMML-Evaluator Wide range of Model types spring-xd-modules/analytics-ml-pmml on Github