Real-Time Alarm Verifica/on with Spark Streaming and Machine Learning

Size: px
Start display at page:

Download "Real-Time Alarm Verifica/on with Spark Streaming and Machine Learning"

Transcription

1 Real-Time Alarm Verifica/on with Spark Streaming and Machine Learning Ana Sima, Jan Stampfli, Kurt Stockinger Zurich University of Applied Sciences Swiss Big Data User Group Mee3ng January 23, 2017

2 About Me Prof. Dr. Kurt Stockinger Since summer 2013 at ZHAW Databases, Data Warehousing, Big Data : Data Warehouse & Business Intelligence Architect : Computer Scien3st : Computer Scien3st 2

3 What is the Problem with Alarm Systems? 3

4 A Typical Example Serious universi3es 4

5 with Serious Students 5

6 A Serious Party in a Student Home 6

7 Serious Fire Fighters 7

8 Why Machine Learning? Around 90% of reported incidents are false alarms Human operators should priori3ze true alarms Machine learning is able to discover the pa_erns buried in the alarm data separate true from false alarms with a high accuracy 8

9 DIGITAL FOOTPRINT OF A SECURED OBJECT Alarm panel Transmitter User behavior Alarm types User behavior will improve overall process User based business models are possible with machine learning tools e.g. Services payments only when an alarm panel is activated 9

10 Research Project with Sitasys 10

11 Talk Outline Machine learning for alarm verifica3on End-to-end performance analysis: Stream processing (live analysis) Batch processing (historic analysis) Online machine learning (live verifica3on) 11

12 Machine Learning with Spark 12

13 About Me Jan Stampfli The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. ZHAW Bachelor in Computer Science Graduated in 2014 Research assistant at ZHAW Since September 2014 Working in research projects on the topics Big Data Machine Learning Informa3on Retrieval 13

14 A Typical Machine Learning Process 14

15 Spark ML Pipeline Pipeline DataFrame Transformer Transformer Pipeline Transformer Estimator Estimator Estimator loca/on label Bern 1 Chur 0 Sion 1 hash predic/on probabili/es 0 [0.51, 0.49] 0 [0.99, 0.01] 1 [0.13, 0.87] h_p://spark.apache.org/docs/latest/ml-pipeline.html 15

16 Spark ML Tuning CrossValidator DataFrame Pipeline Evaluator Random Forest Number of trees Max depth of trees Parameter Map [ 10, 20, 30 ] [ 5, 25, 50 ] h_p://spark.apache.org/docs/latest/ml-tuning.html 16

17 Spark ML Evalua3on Prediction DataFrame Evaluator label probabili/es 1 [0.51, 0.49] 0 [0.99, 0.01] 1 [0.13, 0.87] Accuracy Precision Recall h_p://spark.apache.org/docs/latest/mllib-evalua3on-metrics.html 17

18 Applied Machine Learning 18

19 Data about Real Alarms Total False alarms True alarms Total Training Test Features Loca3on, sensor type, day of week, Labels 0 = false alarm 1 = true alarm 19

20 Applied Algorithms Random Forest Support Vector Machine Logis3c Regression Deep Neural Network (DNN) 20

21 Results % % Accuracy Jan Stampfli, Kurt Stockinger (2016). Applied data Science: Using Machine Learning for Alarm Verifica3on, ERCIM News, 107,

22 Evalua3on Random Forest Confusion matrix Predic/on True Predic/on False Label True Label False Precision (correctly predicted alarms) Predic/on True Predic/on False Label True 92.07% Label False 92.62% Recall (correctly detected alarms) Predic/on True Predic/on False Label True 93.38% Label False 91.18% 22

23 Conclusion about ML Human operators are now able to priori3ze alarms predicted as true False alarms must s3ll be evaluated 23

24 Performance 24

25 About Me Ana Sima EPFL Master in SW Systems, 2015 Previously worked on a framework for unified stream & batch processing (Cyclone) Joined ZHAW since November

26 Welcome back from the holidays! 26

27 Why is alarm processing a 3mecri3cal mission? 27

28 When things go wrong It s all about response 3me! 28

29 Performance tes3ng Data Generator Alarm stream Data Consumer Streaming Historic Machine Learning 29

30 Designing the consumer Spark - A unified engine for Big Data Processing Streaming Machine Learning SQL DataFrames DataSets RDDs Graph Processing Kryo Gson Jackson 30

31 No one size fits all, so how do I choose??? 31

32 The base case Producer Write N alarms/s Alarms Consumer Project & Filter Dis3nct Compute Histogram Predict True / False Historic query 32

33 Consumer Workflow Every window of 10s Streaming maddrsinwindow = alarms.map(alarm::getmacaddress).distinct(); Historic Select MacAddress, count(*) from mongodb Where MacAddress in maddrsinwindow Group By MacAddress Machine Learning (Covered in first half of talk) 33

34 SAVE Benchmark Iden3fying Bo_lenecks ZHAW Datalab 1 34

35 Iden3fying bo_lenecks (1) 1. Started experiment with 8-core Kata Producer 8-core Spark Consumer 2. Consumer slow even for a basic streaming task Count number of elements in window => a few sec 3. Producer not outpuung expected throughput around 12K/s (record size 600B) 35

36 Common root cause? Producer Serialize, (fasterxml Jackson) Write N alarms/s Serialized Alarms Consumer Deserialize, Project & Filter Dis3nct Compute Histogram Predict True / False Historic query 36

37 Does it really ma_er? Per-object serializa3on only takes a few tens of ns Mul3ply that by 12K.. And btw, have it all sent in 1s! 37

38 Are we using the right tool? Answer: benchmarks *! If your environment primarily deals with lots of small JSON requests [ ] then GSON is your library of interest. Jackson struggles the most with small files. Jackson: up to 3x slower for small objects *The ulmmate JSON library: hrp://blog.takipi.com/the-ulmmate-json-library-json-simple-vs-gson-vs-jackson-vs-json/ 38

39 SAVE Benchmark Iden3fying Bo_lenecks ZHAW Datalab 2 39

40 Results (1) + 30,000 25,000 20,000 15,000 10,000 Producer max throughput (alarms/s) 5,000 0 Jackson Gson 30,000 25,000 20,000 15,000 10,000 Consumer max throughput (alarms/s) 5,000 0 Jackson Gson * < 30 LOC change => ~ 2x speedup in both Producer & Consumer 40

41 How much will fit in 10s? 12 Breakdown of /me / subtask Kata + Jackson (3.5 K alarms) Kata + GSON (3.5 K alarms) Kata + GSON (5.7 K alarms) Streaming Historic Machine Learning Note: the alarms RDD is used for both the streaming & the ML part => serializer change benefits both 41

42 Iden3fying bo_lenecks (2) 1. Caching? Brought some improvement, but small (~ 5%) The amount of data reused is too li_le (20 MB) 2. Simplifying the design? The KISS* principle Keep It Simple, Stupid! 42

43 Simplifica3on Producer Write N alarms/s Alarms Consumer Project & Filter Dis3nct Compute Histogram Predict True / False Alarms Capped collec3on 43

44 SAVE Benchmark Iden3fying Bo_lenecks ZHAW Datalab 3 44

45 Results (2) 30,000 25,000 20,000 15,000 10,000 5,000 0 Producer Max throughput (alarms/s) 30,000 25,000 20,000 15,000 10,000 5, Max throughput (alarms/s) Note: Producer slow, but previous max consumer throughput! Consumer: ~ 50% speedup. 45

46 How much will fit in 10s? 12 Breakdown of /me / subtask Kata + Jackson (3.5 K alarms) Kata + GSON (3.5 K alarms) Mongo + GSON (3.5 K alarms) Streaming Historic Machine Learning 46

47 Root cause? Using Kata direct stream with a single par33on => poor parallelism Most opera3ons run on a single executor Even though Spark is configured to run on 8 cores Using MongoDB => finer grained control over parallelism level for the RDD Read an array of documents from tailable cursor Parallelize array into RDD & run ML on each part 47

48 FIX IT! Configuring the Kata Direct Stream in Spark with proper seungs. (num par33ons = num cores) 48

49 Results (3) 30,000 25,000 20,000 15,000 10,000 5,000 0 Producer Max throughput (alarms/s) 30,000 25,000 20,000 15,000 10,000 5, Max throughput (alarms/s) 49

50 Designing the consumer Spark - A unified engine for Big Data Processing Streaming SQL Machine Learning Easy to use tools can some/mes be tricky to get right DataFrames DataSets RDDs Graph Processing Kryo Gson Jackson 50

51 Conclusions & Lessons Learned Machine learning algorithm of Spark can be directly applied Works well for small and big data No re-writing necessary to make algorithm scale across computing cluster Spark Streaming can t be used out of the box for stream and batch processing: Need to use a persistency layer Requires significant performance tuning Working with our industry partner Sitasys: Great collaboration Working with real-world problem is very rewarding Can make real contribution and enhance existing algorithms and technology 51

52 Appendix 52

53 Pipeline Example Code (Java) // Create string indexer Transformer. StringIndexer stringindexer = new StringIndexer().setInputCol("loca/on").setOutputCol("indexLoca/on"); // Create random forest EsMmator. RandomForestClassifier randomforestclf = new RandomForestClassifier().setFeaturesCol("indexLoca/on").setLabelCol("label"); // Create pipeline EsMmator (specifying the ML workflow). Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{stringIndexer, randomforestclf}); // Create cross validamon EsMmator (wrapping the pipeline EsMmator). ParamMap[] paramgrid = new ParamGridBuilder().addGrid(randomForestClf.numTrees()(), new int[]{10, 20, 30}).addGrid(randomForestClf.maxDepth(), new int[]{5, 25, 50}).build(); CrossValidator crossvalidator = new CrossValidator().setEs3mator(pipeline).setEs3matorParamMaps(paramGrid).setEvaluator(new BinaryClassifica3onEvaluator()).setNumFolds(10); 53

54 Training, Tes3ng and Evalua3on Example Code (Java) // Given (creamon of dataset no included in example code) Dataset<Row> trainset = ; Dataset<Row> testset = ; // Fit the cross validator to training documents. CrossValidatorModel model = crossvalidator.fit(trainset); // Make predicmons on test documents. Dataset<Row> predic3ons = model.transform(testset); // Create evaluator. Mul3classClassifica3onEvaluator evaluator = new Mul3classClassifica3onEvaluator().setLabelCol("label").setPredic3onCol("predic/on"); // Evaluate predicmons. evaluator.setmetricname("accuracy"); double accuracy = evaluator.evaluate(predic3ons); 54

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies

Informa)on Retrieval and Map- Reduce Implementa)ons. Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies Informa)on Retrieval and Map- Reduce Implementa)ons Mohammad Amir Sharif PhD Student Center for Advanced Computer Studies mas4108@louisiana.edu Map-Reduce: Why? Need to process 100TB datasets On 1 node:

More information

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

ML pipelines at OK. Dmitry Bugaychenko

ML pipelines at OK. Dmitry Bugaychenko ML pipelines at OK Dmitry Bugaychenko OK is 70 000 000+ monthly unique users OK is 800 000 000+ family links in the social graph OK is A place where people share their posi9ve feelings OK is 10000+ servers

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch Nick Pentreath Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data

More information

New Developments in Spark

New Developments in Spark New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunnie Chung Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunny Chung Presented by: Sonal Deshmukh Jay Upadhyay What is Apache Survey shows huge popularity spike for Apache

More information

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap Topics Who - Key Hashmap Team Members The Use Case - Our Need for a Memory

More information

About Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals

More information

MLeap: Release Spark ML Pipelines

MLeap: Release Spark ML Pipelines MLeap: Release Spark ML Pipelines Hollin Wilkins & Mikhail Semeniuk SATURDAY Web Dev @ Cornell Studied some General Biology Rails Consulting for TrueCar and other companies Implement ML model for ClearBook

More information

Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd + MongoDB + Spark = Awesome Sauce Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Benchmarking Spark ML using BigBench. Sweta Singh TPCTC 2016

Benchmarking Spark ML using BigBench. Sweta Singh TPCTC 2016 Benchmarking Spark ML using BigBench Sweta Singh singhswe@us.ibm.com TPCTC 2016 Motivation Study the performance of Machine Learning use cases on large data warehouses in context of assessing Alternate

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

Databricks, an Introduction

Databricks, an Introduction Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,

More information

Stages of (Batch) Machine Learning

Stages of (Batch) Machine Learning Evalua&on Stages of (Batch) Machine Learning Given: labeled training data X, Y = {hx i,y i i} n i=1 Assumes each x i D(X ) with y i = f target (x i ) Train the model: model ß classifier.train(x, Y ) x

More information

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without

More information

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 WHITE PAPER Apache Spark: RDD, DataFrame and Dataset API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 Prepared by: Eyal Edelman, Big Data Practice Lead Michael Birch, Big Data and

More information

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions

1 Big Data Hadoop. 1. Introduction About this Course About Big Data Course Logistics Introductions Big Data Hadoop Architect Online Training (Big Data Hadoop + Apache Spark & Scala+ MongoDB Developer And Administrator + Apache Cassandra + Impala Training + Apache Kafka + Apache Storm) 1 Big Data Hadoop

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

Accelerating Spark Workloads using GPUs

Accelerating Spark Workloads using GPUs Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark

More information

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Best Practices and Performance Tuning on Amazon Elastic MapReduce Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or

More information

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum. NYC Data Science Academy Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

More information

Fast Big Data Analytics with Spark on Tachyon

Fast Big Data Analytics with Spark on Tachyon 1 Fast Big Data Analytics with Spark on Tachyon Shaoshan Liu http://www.meetup.com/tachyon/ 2 Fun Facts Tachyon A tachyon is a particle that always moves faster than light. The word comes from the Greek:

More information

KNIME for the life sciences Cambridge Meetup

KNIME for the life sciences Cambridge Meetup KNIME for the life sciences Cambridge Meetup Greg Landrum, Ph.D. KNIME.com AG 12 July 2016 What is KNIME? A bit of motivation: tool blending, data blending, documentation, automation, reproducibility More

More information

Data Analytics and Machine Learning: From Node to Cluster

Data Analytics and Machine Learning: From Node to Cluster Data Analytics and Machine Learning: From Node to Cluster Presented by Viswanath Puttagunta Ganesh Raju Understanding use cases to optimize on ARM Ecosystem Date BKK16-404B March 10th, 2016 Event Linaro

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Big Data processing: a framework suitable for Economists and Statisticians

Big Data processing: a framework suitable for Economists and Statisticians Big Data processing: a framework suitable for Economists and Statisticians Giuseppe Bruno 1, D. Condello 1 and A. Luciani 1 1 Economics and statistics Directorate, Bank of Italy; Economic Research in High

More information

ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson YuLing Chen, Cisco

ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson YuLing Chen, Cisco ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson YuLing Chen, Cisco ODL based AI/ML for Networks Agenda Role of AI/ML in networks and usecase ODL Overview in 30 seconds ODL TSDR Algorithms History

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

ID2223 Lecture 3: Gradient Descent and SparkML

ID2223 Lecture 3: Gradient Descent and SparkML ID2223 Lecture 3: Gradient Descent and SparkML Optimization Theory Review 2017-11-08 ID2223, Large Scale Machine Learning and Deep Learning, Jim Dowling 2/91 Unconstrained Optimization Unconstrained optimization

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Lambda Architecture for Batch and Stream Processing. October 2018

Lambda Architecture for Batch and Stream Processing. October 2018 Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.

More information

RV: A Run'me Verifica'on Framework for Monitoring, Predic'on and Mining

RV: A Run'me Verifica'on Framework for Monitoring, Predic'on and Mining RV: A Run'me Verifica'on Framework for Monitoring, Predic'on and Mining Patrick Meredith Grigore Rosu University of Illinois at Urbana Champaign (UIUC) Run'me Verifica'on, Inc. (joint work with Dongyun

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat H5Spark H5Spark: Bridging the I/O Gap between Spark and Scien9fic Data Formats on HPC Systems Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Suren Byna, Mike

More information

Scaling MongoDB: Avoiding Common Pitfalls. Jon Tobin Senior Systems

Scaling MongoDB: Avoiding Common Pitfalls. Jon Tobin Senior Systems Scaling MongoDB: Avoiding Common Pitfalls Jon Tobin Senior Systems Engineer Jon.Tobin@percona.com @jontobs www.linkedin.com/in/jonathanetobin Agenda Document Design Data Management Replica3on & Failover

More information

Higher level data processing in Apache Spark

Higher level data processing in Apache Spark Higher level data processing in Apache Spark Pelle Jakovits 12 October, 2016, Tartu Outline Recall Apache Spark Spark DataFrames Introduction Creating and storing DataFrames DataFrame API functions SQL

More information

Evolution of an Apache Spark Architecture for Processing Game Data

Evolution of an Apache Spark Architecture for Processing Game Data Evolution of an Apache Spark Architecture for Processing Game Data Nick Afshartous WB Analytics Platform May 17 th 2017 May 17 th, 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Deep Learning Frameworks with Spark and GPUs

Deep Learning Frameworks with Spark and GPUs Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,

More information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training

More information

Document Databases: MongoDB

Document Databases: MongoDB NDBI040: Big Data Management and NoSQL Databases hp://www.ksi.mff.cuni.cz/~svoboda/courses/171-ndbi040/ Lecture 9 Document Databases: MongoDB Marn Svoboda svoboda@ksi.mff.cuni.cz 28. 11. 2017 Charles University

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Copyright 2018, Oracle and/or its affiliates. All rights reserved.

Copyright 2018, Oracle and/or its affiliates. All rights reserved. Beyond SQL Tuning: Insider's Guide to Maximizing SQL Performance Monday, Oct 22 10:30 a.m. - 11:15 a.m. Marriott Marquis (Golden Gate Level) - Golden Gate A Ashish Agrawal Group Product Manager Oracle

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

CS535 Big Data Fall 2017 Colorado State University   10/10/2017 Sangmi Lee Pallickara Week 8- A. CS535 Big Data - Fall 2017 Week 8-A-1 CS535 BIG DATA FAQs Term project proposal New deadline: Tomorrow PA1 demo PART 1. BATCH COMPUTING MODELS FOR BIG DATA ANALYTICS 5. ADVANCED DATA ANALYTICS WITH APACHE

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data IBM Db2 Event Store Disclaimer The information contained in this presentation is provided for informational purposes only.

More information

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL Building High Performance Apps using NoSQL Swami Sivasubramanian General Manager, AWS NoSQL Building high performance apps There is a lot to building high performance apps Scalability Performance at high

More information

Apache Spark and Scala Certification Training

Apache Spark and Scala Certification Training About Intellipaat Intellipaat is a fast-growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 600,000 in over

More information

Approaching the Petabyte Analytic Database: What I learned

Approaching the Petabyte Analytic Database: What I learned Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

Stream Processing on IoT Devices using Calvin Framework

Stream Processing on IoT Devices using Calvin Framework Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Crescando: Predictable Performance for Unpredictable Workloads

Crescando: Predictable Performance for Unpredictable Workloads Crescando: Predictable Performance for Unpredictable Workloads G. Alonso, D. Fauser, G. Giannikis, D. Kossmann, J. Meyer, P. Unterbrunner Amadeus S.A. ETH Zurich, Systems Group (Funded by Enterprise Computing

More information

Decision models for the Digital Economy

Decision models for the Digital Economy Decision Camp 2017 Birbeck, University of London Decision models for the Digital Economy Vijay Bandekar InteliOps Inc. Agenda Problem Statement Proposed Solution Case studies and results Key takeaways

More information

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu NoSQL Databases MongoDB vs Cassandra Kenny Huynh, Andre Chik, Kevin Vu Introduction - Relational database model - Concept developed in 1970 - Inefficient - NoSQL - Concept introduced in 1980 - Related

More information

ICALEPS 2013 Exploring No-SQL Alternatives for ALMA Monitoring System ADC

ICALEPS 2013 Exploring No-SQL Alternatives for ALMA Monitoring System ADC ICALEPS 2013 Exploring No-SQL Alternatives for ALMA Monitoring System Overview The current paradigm (CCL and Relational DataBase) Propose of a new monitor data system using NoSQL Monitoring Storage Requirements

More information

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms

09/05/2018. Spark MLlib is the Spark component providing the machine learning/data mining algorithms Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification (supervised learning) Clustering (unsupervised learning) Itemset mining

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

MapReduce. Tom Anderson

MapReduce. Tom Anderson MapReduce Tom Anderson Last Time Difference between local state and knowledge about other node s local state Failures are endemic Communica?on costs ma@er Why Is DS So Hard? System design Par??oning of

More information

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent

Introduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage

More information

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio

Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio CASE STUDY Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio Xueyan Li, Lei Xu, and Xiaoxu Lv Software Engineers at Qunar At Qunar, we have been running Alluxio in production for over

More information

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,

More information

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now?

Incrementally Parallelizing. Twofold Speedup on a Quad-Core. Thread-Level Speculation. A Case Study with BerkeleyDB. What Am I Working on Now? Incrementally Parallelizing Database Transactions with Thread-Level Speculation Todd C. Mowry Carnegie Mellon University (in collaboration with Chris Colohan, J. Gregory Steffan, and Anastasia Ailamaki)

More information

Real-Time Decisions Using ML on the Google Cloud Platform. Przemysław Pastuszka & Carlos Garcia QCon London 7th March 2018

Real-Time Decisions Using ML on the Google Cloud Platform. Przemysław Pastuszka & Carlos Garcia QCon London 7th March 2018 Real-Time Decisions Using ML on the Google Cloud Platform Przemysław Pastuszka & Carlos Garcia QCon London 7th March 2018 How many of you are interested in machine learning? but how many of you are running

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Gabriel Villa. Architecting an Analytics Solution on AWS

Gabriel Villa. Architecting an Analytics Solution on AWS Gabriel Villa Architecting an Analytics Solution on AWS Cloud and Data Architect Skilled leader, solution architect, and technical expert focusing primarily on Microsoft technologies and AWS. Passionate

More information

Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn

Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn Amol Deshpande, University of Maryland Lisa Hellerstein, Polytechnic University, Brooklyn Mo>va>on: Parallel Query Processing Increasing parallelism in compu>ng Shared nothing clusters, mul> core technology,

More information

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

MLlib. Distributed Machine Learning on. Evan Sparks.  UC Berkeley MLlib & ML base Distributed Machine Learning on Evan Sparks UC Berkeley January 31st, 2014 Collaborators: Ameet Talwalkar, Xiangrui Meng, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia,

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

TPCX-BB (BigBench) Big Data Analytics Benchmark

TPCX-BB (BigBench) Big Data Analytics Benchmark TPCX-BB (BigBench) Big Data Analytics Benchmark Bhaskar D Gowda Senior Staff Engineer Analytics & AI Solutions Group Intel Corporation bhaskar.gowda@intel.com 1 Agenda Big Data Analytics & Benchmarks Industry

More information

Main Points. File systems. Storage hardware characteris7cs. File system usage Useful abstrac7ons on top of physical devices

Main Points. File systems. Storage hardware characteris7cs. File system usage Useful abstrac7ons on top of physical devices Storage Systems Main Points File systems Useful abstrac7ons on top of physical devices Storage hardware characteris7cs Disks and flash memory File system usage pa@erns File Systems Abstrac7on on top of

More information

SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics

SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics SparkBench: A Comprehensive Spark Benchmarking Suite Characterizing In-memory Data Analytics Min LI,, Jian Tan, Yandong Wang, Li Zhang, Valentina Salapura, Alan Bivens IBM TJ Watson Research Center * A

More information

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This

More information

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to

More information

Flash Reliability in Produc4on: The Importance of Measurement and Analysis in Improving System Reliability

Flash Reliability in Produc4on: The Importance of Measurement and Analysis in Improving System Reliability Flash Reliability in Produc4on: The Importance of Measurement and Analysis in Improving System Reliability Bianca Schroeder University of Toronto (Currently on sabbatical at Microsoft Research Redmond)

More information

Data Analytics at Logitech Snowflake + Tableau = #Winning

Data Analytics at Logitech Snowflake + Tableau = #Winning Welcome # T C 1 8 Data Analytics at Logitech Snowflake + Tableau = #Winning Avinash Deshpande I am a futurist, scientist, engineer, designer, data evangelist at heart Find me at Avinash Deshpande Chief

More information

Druid Data Ingest. Wayne M Adams Data Science and Business Analy7cs Meetup 26 February 2014

Druid Data Ingest. Wayne M Adams Data Science and Business Analy7cs Meetup 26 February 2014 Druid Data Ingest Wayne M Adams Data Science and Business Analy7cs Meetup 26 February 2014 By Druid, we mean The column- oriented, distributed, real- 7me analy7c datastore (hjp://druid.io/ and hjps://github.com/metamx/druid)

More information

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin Who am I? Reynold Xin Stanford CS347 Guest Lecture Spark and distributed data processing PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD), UC Berkeley AMPLab Reynold

More information

Machine learning algorithms for datasets popularity prediction

Machine learning algorithms for datasets popularity prediction Machine learning algorithms for datasets popularity prediction Kipras Kančys and Valentin Kuznetsov Abstract. This report represents continued study where ML algorithms were used to predict databases popularity.

More information

Data Analytics Job Guarantee Program

Data Analytics Job Guarantee Program Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction

More information

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015 Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document

More information

Data Science Training

Data Science Training Data Science Training R, Predictive Modeling, Machine Learning, Python, Bigdata & Spark 9886760678 Introduction: This is a comprehensive course which builds on the knowledge and experience a business analyst

More information

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data

Big Streaming Data Processing. How to Process Big Streaming Data 2016/10/11. Fraud detection in bank transactions. Anomalies in sensor data Big Data Big Streaming Data Big Streaming Data Processing Fraud detection in bank transactions Anomalies in sensor data Cat videos in tweets How to Process Big Streaming Data Raw Data Streams Distributed

More information