Big Data Applications with Spring XD

Similar documents
BIG DATA COURSE CONTENT

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Stream and Batch Processing in the Cloud with Data Microservices. Marius Bogoevici and Mark Fisher, Pivotal

Hadoop. Introduction / Overview

Big Data Syllabus. Understanding big data and Hadoop. Limitations and Solutions of existing Data Analytics Architecture

Big Data Architect.

Innovatus Technologies

DATA SCIENCE USING SPARK: AN INTRODUCTION

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Flash Storage Complementing a Data Lake for Real-Time Insight

Deploying Applications on DC/OS

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Big Data Analytics using Apache Hadoop and Spark with Scala

Hadoop An Overview. - Socrates CCDH

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

The Technology of the Business Data Lake. Appendix

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

An Introduction to Apache Spark

microsoft

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Big Data Hadoop Stack

Hadoop Development Introduction

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack

FAQs. Business (CIP 2.2) AWS Market Place Troubleshooting and FAQ Guide

Exam Questions

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Data Science with PostgreSQL

Configuring and Deploying Hadoop Cluster Deployment Templates

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop course content

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Alexander Klein. #SQLSatDenmark. ETL meets Azure

Turning Relational Database Tables into Spark Data Sources

Certified Big Data Hadoop and Spark Scala Course Curriculum

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Copyright 2016 Pivotal. All rights reserved. Cloud Native Design. Includes 12 Factor Apps

CSE 444: Database Internals. Lecture 23 Spark

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

Copyright 2015 EMC Corporation. All rights reserved. A long time ago

Modern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Kafka Connect the Dots

Data Analytics with HPC. Data Streaming

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Oracle GoldenGate for Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Data Ingestion at Scale. Jeffrey Sica

Gain Insights From Unstructured Data Using Pivotal HD. Copyright 2013 EMC Corporation. All rights reserved.

Hadoop Online Training

Big Data Hadoop Course Content

Spark, Shark and Spark Streaming Introduction

AT&T Flow Designer. Current Environment

HDInsight > Hadoop. October 12, 2017

Unifying Big Data Workloads in Apache Spark

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Franck Mercier. Technical Solution Professional Data + AI Azure Databricks

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Introduction to BigData, Hadoop:-

Oracle Big Data Fundamentals Ed 2

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Deploying Machine Learning Models in Practice

Fluentd + MongoDB + Spark = Awesome Sauce

Technical White Paper

Storm. Distributed and fault-tolerant realtime computation. Nathan Marz Twitter

Stages of Data Processing

Hadoop & Big Data Analytics Complete Practical & Real-time Training

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Hortonworks Data Platform

Big Data Infrastructures & Technologies

Verteego VDS Documentation

Prototyping Data Intensive Apps: TrendingTopics.org

IoT with Apache ActiveMQ, Camel and Spark

REAL-TIME ANALYTICS WITH APACHE STORM

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Data Architectures in Azure for Analytics & Big Data

Oracle Big Data Connectors

Deep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

Apache Spark and Scala Certification Training

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

Microsoft Big Data and Hadoop

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Apache Beam. Modèle de programmation unifié pour Big Data

Data Analytics Job Guarantee Program

Index. Scott Klein 2017 S. Klein, IoT Solutions in Microsoft s Azure IoT Suite, DOI /

Big Data Integrator Platform Platform Architecture and Features Dr. Hajira Jabeen Technical Team Leader-BDE University of Bonn

Dell In-Memory Appliance for Cloudera Enterprise

Specialist ICT Learning

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Transcription:

Big Data Applications with Spring XD Thomas Darimont, Software Engineer, Pivotal Inc. @thomasdarimont Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial llicense: http://creativecommons.org/licenses/by-nc/3.0/

THE FASTEST PATH TO NEW BUSINESS VALUE

Journey Introduction Concepts Applications Outlook 3 Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Introduction 4

Spring XD - Overview Platform for Big Data Applications Ingestion, Processing, Movement, Analytics Stream and Batch Processing Scalable Distributed Runtime Support for Deep Analytics Proven Spring Technologies 5

Spring XD - Why yet another Big Data Platform? Alternative to Frameworks like Flume, Oozie, Sqoop, Storm Just one Platform instead of many Common things easy, complex things possible Complementary to many technologies Big SQL / MPP Databases - Impala, HAWQ Stream Processing - Apache Spark NoSQL DataStores - Cassandra, MongoDB 6

extreme X Data D Spring XD - one stop shop for developing and deploying Big Data Apps 7

Spring XD - 10,000 Foot View >_ Rest Spring XD Runtime taps Streams ingest BIDIRECTIONAL Jobs workflow RDBMS Redis Compute NoSQL HDFS export Predictive Modelling R, SAS 8

Spring XD - Easy to Setup and Run Store incoming HTTP data into HDFS 9

Spring XD - Easy to Setup and Run 1. Install via package manager / unzip 2. Start $ xd-singlenode $ xd-shell 3. Define xd:> stream create ingest --definition http hdfs 4. Run xd:> stream deploy ingest Yes, writing HTTP Data to HDFS can be that simple! 10

Core Concepts 11

Spring XD - Core Concepts Runtime Modules Streams Taps Analytics Jobs Extensibility Deployment 12

Spring XD - Runtime Hosts Stream Processing & Batch Workflows & Analytics Manages Component Distribution Communication via MessageBus Additional Services Configuration / Cluster State: ZooKeeper Analytics: Redis, In-Memory Message Bus: Redis, RabbitMQ, Kafka, Local 13

Spring XD - Instance Types XD-Admin Assigns Modules to Containers Manages Cluster Failover & HA XD-Container Loads / Executes Modules Connects to Data Bus Standalone, YARN, Cloud Foundry XD UI XD Shell XD XD Admin Admin Leader XD Admin Leader Leader XD Container module module module module Batch Job State DB Analytics Repository ZK XD Container module module module module Kafka/RabbitMQ/Redis 14

Spring XD - Runtime Modes XD Admin XD Admin JVM ZK DB MB ZK JVM DB JVM MB single-node standalone XD Container Module JVM multi-node distributed XD Container Module JVM XD Container Module JVM Development Production 15

Spring XD - Distributed Runtime XDA XDA deploy Zookeeper XDC time XDC log XDC XDA = XD Admin XDC = XD Container bind Message Bus 16

Modules 17

Modules Unit of execution Source, Sink, Processor, Jobs Defined in XML or JVM Language Spring config file with Spring Bean Definitions Can have Parameters 50+ already included in XD Define new Modules via Composition 18

Modules - Overview HTTP SFTP Tail File Mail Syslog TCP / Source TCP Client Reactor IP #20 JMS RabbitMQ Time MQTT Mongo Kafka JDBC Gemfire CQ, Source Twitter Search, Stream Stdout Capture Filter Transform Splitter Aggregator HTTP Client Processor Shell Command Script #13 Groovy Python Java JPMML-Evaluator JSON-to-Tuple Object-to-JSON Log File JDBC TCP MQTT Mongo Sink Mail Null Sink #20 Redis RabbitMQ HDFS HDFS Dataset Shell Command GemFire Server Splunk Server Dynamic Router Counter + 1 Gauge + 1 19

Streams 20

Streams Programming model for real-time processing How data is collected, processed, and stored or forwarded DSL analog to Unix Pipes and Filters Source Processor 0 * Sink Data is pumped through MessageBus Spring Integration Components Stream Source Message Bus Processor Sink 21

Streams - Example Transform payload incoming from HTTP to uppercase and send to log stream create test1 --definition "http transform --expression=payload.touppercase() log --deploy Source Processor Option Sink 22

Taps 23

Taps Special type of Stream Consume data along the processing pipeline Original stream stays unaffected Collect metrics and perform analytics Stream Source Processor Sink Processor Sink Message Bus Tap 24

Taps Example First create the stream stream create test1 definition "http transform --expression=payload.touppercase() log --deploy Then create the tap: onto transform stage, add prefix and send to log stream create test1tap --definition tap:stream:test1.transform > transform --expression='tapped: '+payload log --deploy Tap Source Redirection 25

Analytics 26

Analytics Counters Simple Counter - how many tweets? Field Value Counter - how many for tag=#java? Aggregate Counter - how many tweets for #java per time interval? Gauges Gauge - what was the last seen value? Rich Gauge - what was the last seen value/avg/min/max? Backed by Redis, In-Memory via Spring Data Repositories Accessible via XD-Shell and REST API on XD-Admin 27

Advanced Analytics Processor Modules Python: numpy, pandas, scikit-learn, NLTK, SimpleCV Shell: R-Project rscript, OpenCV Java / Groovy PMML Processor Module Predictive Model Markup Language Description of Parameterised Data Mining Models Allows to Operationalise Predictive Models Real-time evaluation and scoring 28

Jobs 29

Jobs Programming Model for Batch Processing Create, Schedule, Execute and Monitor Spring Batch and Spring Hadoop Components CSV to JDBC FTP to Jobs HDFS JDBC to HDFS #5 HDFS to JDBC HDFS to MongoDB 30

Jobs - Example Create job from existing job definition job create --name "helloworld-job" --definition helloworld" --deploy Run job once job launch --name "helloworld-job" Run job periodically stream create --name "hw-cron" --definition "trigger --cron='0/5 * * * * *' > queue:job:helloworld-job deploy 31

Management 32

Spring XD - Shell CLI based on Spring Shell Manages Streams, Jobs, Analytics and Deployment Completion / Assist Many built-in Commands try help Started via xd-shell 33

Spring XD - Admin UI Management Interface accessible from XD-Admin Node XD-ADMIN:9393/admin-ui 34

Spring XD - REST Interface accessible from XD-Admin Node used by XD-Shell and Admin-UI http://xd-admin:9393 35

Extensibility 36

Extensibility Custom Modules Source, Sink, Processor, Job Spring Integration, Spring Batch Self-contained fat-jar E.g. to wrap a Java Library Upload new modules via XD-Shell / REST Register custom Spring Expression Language Aliases from java.lang.double.parsedouble(payload.sensorvalue) to #parsedouble(payload.sensorvalue) Scripts Collection of XD commands Automation 37

Deployment 38

Deployment deploy or --deploy stream deploy firststream stream create secondstream --deploy Deployment Manifest Customize via --properties Parameter Control # of Module Instances Define Target Server or Group Direct Binding Stream Data Partitioning 39

Deployment Manifest - Module Count http worker hdfs stream deploy --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 http http worker worker worker hdfs hdfs hdfs worker 40

Deployment Manifest - Module Placement http worker hdfs stream deploy WEB worker --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 module.http.criteria= group.contains( WEB ) http http worker worker worker hdfs hdfs hdfs xd/bin/xd-container --groups="web" 41

Deployment Manifest - Data Partitioning http worker hdfs stream deploy WEB 0 worker --properties module.worker.count=4, module.http.producer.partitionkeyexpression= payload.customerid http http 1 2 3 worker worker worker hdfs hdfs hdfs partition := hash(payload.customerid) % worker.count 42

Applications 43

Spring XD - Measuring Live Usage for a Major Sports League Measuring live video usage through mobile applications 44

Spring XD - IoT Connected Car Journey and Range Prediction 45

Spring XD - Smartgrid ACM Distributed Event Based Systems 2014 Scalable, Real-Time Analytics, High Volume Sensor Data Short-Term Load Forecasting in a Power Grid Sensor Data from Smart Plugs Stream Components Sensor Data Ingestion Data Aggregation Load Prediction Demo Analytics via REST 46

What s next? 47

Roadmap - 1.2 and beyond Custom Modules in HDFS More OOTB Modules Web based Editor for Streams & Jobs Apache Ambari Support Security Enhancements Spring XD on Pivotal Cloud Foundry Currently in Beta http://docs.pivotal.io/spring-xd GA Release Planned for 2015 48

Learn more Project http://projects.spring.io/spring-xd GitHub https://github.com/spring-projects/spring-xd Wiki http://docs.spring.io/spring-xd/docs/current/reference/html/ Samples https://github.com/spring-projects/spring-xd-samples Modules https://github.com/spring-projects/spring-xd-modules JIRA https://jira.spring.io/browse/xd Stackoverflow http://stackoverflow.com/questions/tagged/spring-xd 49

Spring XD - Takeaway Increased Productivity through out-of-the-box components Unified runtime for both Real-time and Batch use cases Scalable, Distributed and Fault Tolerant Runtime Closed Loop Analytics through online (stream) and offline (batch) data Data Ingestion, Processing, Movement, Analytics Swiss-army knife of data movement and data pipelines Repeatable turnkey solution for next generation data-centric use cases 50

Learn More. Stay Connected. Twitter: twitter.com/springcentral YouTube: spring.io/video LinkedIn: spring.io/linkedin Google Plus: spring.io/gplus 51 Unless otherwise indicated, these slides are 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/

Backup Slides 52

Lambda Architecture 53

Lambda Architecture

Lambda Architecture - Spring XD Gemfire XD> Spring Stream Processing Serving Layer Speed Layer Real-time Views Spring Boot Batch Processing Workflow Orchestration Ingest Data Lake Spring Boot HAWQ Spring Boot Export Analytics Batch Layer Predictive Analytics Batch Views Spring Boot

Predictive Models Model Parameterised Algorithm Model Building Derive a parameterised algorithm from the data Slow process Usually large data volume -> done offline as a batch process Model Scoring Use the model to predict new information Fast process Can be done as part of stream processing 56

PMML Predictive Model Markup Language Open Standard Maintained by Data Mining Group (DMG) XML based DSL for predictive models Can be interpreted 15 Model Types (Naive Bayes, General Regression, Neural Networks, etc.) First Version (1999) Current Version 4.2.1 Lingua Franca for Predictive Models Bridge the Gap between Data Scientists and Engineers 57

Anatomy of a PMML Model Predictive Model Algorithm description(s) Parameterisation trained model Pre Processing Post Processing Transform model output Thresholds / Business rules Source:(PMML(in(Ac/on,(2 nd (Edi/on,(2012,(p.(7. 58

Predictive Analytics with Spring XD XD Module analytic-pmml Introduced in Spring 1.0.0 M6 (April 2014) Real-time evaluation and scoring Based on JPMML-Evaluator Wide range of Model types spring-xd-modules/analytics-ml-pmml on Github