Real-Time Alarm Verifica/on with Spark Streaming and Machine Learning

Size: px

Start display at page:

Download "Real-Time Alarm Verifica/on with Spark Streaming and Machine Learning"

Kerrie Fletcher
5 years ago
Views:

1 Real-Time Alarm Verifica/on with Spark Streaming and Machine Learning Ana Sima, Jan Stampfli, Kurt Stockinger Zurich University of Applied Sciences Swiss Big Data User Group Mee3ng January 23, 2017

2 About Me Prof. Dr. Kurt Stockinger Since summer 2013 at ZHAW Databases, Data Warehousing, Big Data : Data Warehouse & Business Intelligence Architect : Computer Scien3st : Computer Scien3st 2

3 What is the Problem with Alarm Systems? 3

4 A Typical Example Serious universi3es 4

5 with Serious Students 5

6 A Serious Party in a Student Home 6

7 Serious Fire Fighters 7

8 Why Machine Learning? Around 90% of reported incidents are false alarms Human operators should priori3ze true alarms Machine learning is able to discover the pa_erns buried in the alarm data separate true from false alarms with a high accuracy 8

9 DIGITAL FOOTPRINT OF A SECURED OBJECT Alarm panel Transmitter User behavior Alarm types User behavior will improve overall process User based business models are possible with machine learning tools e.g. Services payments only when an alarm panel is activated 9

10 Research Project with Sitasys 10

11 Talk Outline Machine learning for alarm verifica3on End-to-end performance analysis: Stream processing (live analysis) Batch processing (historic analysis) Online machine learning (live verifica3on) 11

12 Machine Learning with Spark 12

13 About Me Jan Stampfli The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. ZHAW Bachelor in Computer Science Graduated in 2014 Research assistant at ZHAW Since September 2014 Working in research projects on the topics Big Data Machine Learning Informa3on Retrieval 13

14 A Typical Machine Learning Process 14

15 Spark ML Pipeline Pipeline DataFrame Transformer Transformer Pipeline Transformer Estimator Estimator Estimator loca/on label Bern 1 Chur 0 Sion 1 hash predic/on probabili/es 0 [0.51, 0.49] 0 [0.99, 0.01] 1 [0.13, 0.87] h_p://spark.apache.org/docs/latest/ml-pipeline.html 15

16 Spark ML Tuning CrossValidator DataFrame Pipeline Evaluator Random Forest Number of trees Max depth of trees Parameter Map [ 10, 20, 30 ] [ 5, 25, 50 ] h_p://spark.apache.org/docs/latest/ml-tuning.html 16

17 Spark ML Evalua3on Prediction DataFrame Evaluator label probabili/es 1 [0.51, 0.49] 0 [0.99, 0.01] 1 [0.13, 0.87] Accuracy Precision Recall h_p://spark.apache.org/docs/latest/mllib-evalua3on-metrics.html 17

18 Applied Machine Learning 18

19 Data about Real Alarms Total False alarms True alarms Total Training Test Features Loca3on, sensor type, day of week, Labels 0 = false alarm 1 = true alarm 19

20 Applied Algorithms Random Forest Support Vector Machine Logis3c Regression Deep Neural Network (DNN) 20

21 Results % % Accuracy Jan Stampfli, Kurt Stockinger (2016). Applied data Science: Using Machine Learning for Alarm Verifica3on, ERCIM News, 107,

22 Evalua3on Random Forest Confusion matrix Predic/on True Predic/on False Label True Label False Precision (correctly predicted alarms) Predic/on True Predic/on False Label True 92.07% Label False 92.62% Recall (correctly detected alarms) Predic/on True Predic/on False Label True 93.38% Label False 91.18% 22

23 Conclusion about ML Human operators are now able to priori3ze alarms predicted as true False alarms must s3ll be evaluated 23

24 Performance 24

25 About Me Ana Sima EPFL Master in SW Systems, 2015 Previously worked on a framework for unified stream & batch processing (Cyclone) Joined ZHAW since November

26 Welcome back from the holidays! 26

27 Why is alarm processing a 3mecri3cal mission? 27

28 When things go wrong It s all about response 3me! 28

29 Performance tes3ng Data Generator Alarm stream Data Consumer Streaming Historic Machine Learning 29

30 Designing the consumer Spark - A unified engine for Big Data Processing Streaming Machine Learning SQL DataFrames DataSets RDDs Graph Processing Kryo Gson Jackson 30

31 No one size fits all, so how do I choose??? 31

32 The base case Producer Write N alarms/s Alarms Consumer Project & Filter Dis3nct Compute Histogram Predict True / False Historic query 32

33 Consumer Workflow Every window of 10s Streaming maddrsinwindow = alarms.map(alarm::getmacaddress).distinct(); Historic Select MacAddress, count(*) from mongodb Where MacAddress in maddrsinwindow Group By MacAddress Machine Learning (Covered in first half of talk) 33

34 SAVE Benchmark Iden3fying Bo_lenecks ZHAW Datalab 1 34

35 Iden3fying bo_lenecks (1) 1. Started experiment with 8-core Kata Producer 8-core Spark Consumer 2. Consumer slow even for a basic streaming task Count number of elements in window => a few sec 3. Producer not outpuung expected throughput around 12K/s (record size 600B) 35

36 Common root cause? Producer Serialize, (fasterxml Jackson) Write N alarms/s Serialized Alarms Consumer Deserialize, Project & Filter Dis3nct Compute Histogram Predict True / False Historic query 36

37 Does it really ma_er? Per-object serializa3on only takes a few tens of ns Mul3ply that by 12K.. And btw, have it all sent in 1s! 37

38 Are we using the right tool? Answer: benchmarks *! If your environment primarily deals with lots of small JSON requests [ ] then GSON is your library of interest. Jackson struggles the most with small files. Jackson: up to 3x slower for small objects *The ulmmate JSON library: hrp://blog.takipi.com/the-ulmmate-json-library-json-simple-vs-gson-vs-jackson-vs-json/ 38

39 SAVE Benchmark Iden3fying Bo_lenecks ZHAW Datalab 2 39

40 Results (1) + 30,000 25,000 20,000 15,000 10,000 Producer max throughput (alarms/s) 5,000 0 Jackson Gson 30,000 25,000 20,000 15,000 10,000 Consumer max throughput (alarms/s) 5,000 0 Jackson Gson * < 30 LOC change => ~ 2x speedup in both Producer & Consumer 40

41 How much will fit in 10s? 12 Breakdown of /me / subtask Kata + Jackson (3.5 K alarms) Kata + GSON (3.5 K alarms) Kata + GSON (5.7 K alarms) Streaming Historic Machine Learning Note: the alarms RDD is used for both the streaming & the ML part => serializer change benefits both 41

42 Iden3fying bo_lenecks (2) 1. Caching? Brought some improvement, but small (~ 5%) The amount of data reused is too li_le (20 MB) 2. Simplifying the design? The KISS* principle Keep It Simple, Stupid! 42

43 Simplifica3on Producer Write N alarms/s Alarms Consumer Project & Filter Dis3nct Compute Histogram Predict True / False Alarms Capped collec3on 43

44 SAVE Benchmark Iden3fying Bo_lenecks ZHAW Datalab 3 44

45 Results (2) 30,000 25,000 20,000 15,000 10,000 5,000 0 Producer Max throughput (alarms/s) 30,000 25,000 20,000 15,000 10,000 5, Max throughput (alarms/s) Note: Producer slow, but previous max consumer throughput! Consumer: ~ 50% speedup. 45

How much will fit in 10s? 12 Breakdown of /me / subtask 10 8 6 7 4 2 0 0.02 2.9 Kata + Jackson (3.

46 How much will fit in 10s? 12 Breakdown of /me / subtask Kata + Jackson (3.5 K alarms) Kata + GSON (3.5 K alarms) Mongo + GSON (3.5 K alarms) Streaming Historic Machine Learning 46

47 Root cause? Using Kata direct stream with a single par33on => poor parallelism Most opera3ons run on a single executor Even though Spark is configured to run on 8 cores Using MongoDB => finer grained control over parallelism level for the RDD Read an array of documents from tailable cursor Parallelize array into RDD & run ML on each part 47

48 FIX IT! Configuring the Kata Direct Stream in Spark with proper seungs. (num par33ons = num cores) 48

49 Results (3) 30,000 25,000 20,000 15,000 10,000 5,000 0 Producer Max throughput (alarms/s) 30,000 25,000 20,000 15,000 10,000 5, Max throughput (alarms/s) 49

50 Designing the consumer Spark - A uniﬁed engine for Big Data Processing Streaming SQL Machine Learning Easy to use tools can some/mes be tricky to get right DataFrames DataSets RDDs Graph Processing Kryo Gson Jackson 50

Conclusions & Lessons Learned Machine learning algorithm of Spark can be directly applied Works well for small and big data No re-writing necessary to make algorithm scale across computing cluster

51 Conclusions & Lessons Learned Machine learning algorithm of Spark can be directly applied Works well for small and big data No re-writing necessary to make algorithm scale across computing cluster Spark Streaming can t be used out of the box for stream and batch processing: Need to use a persistency layer Requires significant performance tuning Working with our industry partner Sitasys: Great collaboration Working with real-world problem is very rewarding Can make real contribution and enhance existing algorithms and technology 51

52 Appendix 52

53 Pipeline Example Code (Java) // Create string indexer Transformer. StringIndexer stringindexer = new StringIndexer().setInputCol("loca/on").setOutputCol("indexLoca/on"); // Create random forest EsMmator. RandomForestClassifier randomforestclf = new RandomForestClassifier().setFeaturesCol("indexLoca/on").setLabelCol("label"); // Create pipeline EsMmator (specifying the ML workflow). Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{stringIndexer, randomforestclf}); // Create cross validamon EsMmator (wrapping the pipeline EsMmator). ParamMap[] paramgrid = new ParamGridBuilder().addGrid(randomForestClf.numTrees()(), new int[]{10, 20, 30}).addGrid(randomForestClf.maxDepth(), new int[]{5, 25, 50}).build(); CrossValidator crossvalidator = new CrossValidator().setEs3mator(pipeline).setEs3matorParamMaps(paramGrid).setEvaluator(new BinaryClassifica3onEvaluator()).setNumFolds(10); 53

54 Training, Tes3ng and Evalua3on Example Code (Java) // Given (creamon of dataset no included in example code) Dataset<Row> trainset = ; Dataset<Row> testset = ; // Fit the cross validator to training documents. CrossValidatorModel model = crossvalidator.fit(trainset); // Make predicmons on test documents. Dataset<Row> predic3ons = model.transform(testset); // Create evaluator. Mul3classClassifica3onEvaluator evaluator = new Mul3classClassifica3onEvaluator().setLabelCol("label").setPredic3onCol("predic/on"); // Evaluate predicmons. evaluator.setmetricname("accuracy"); double accuracy = evaluator.evaluate(predic3ons); 54

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts