Practical Machine Learning Agenda

Similar documents
Thales PunchPlatform Agenda

Chapter 1 - The Spark Machine Learning Library

MLI - An API for Distributed Machine Learning. Sarang Dev

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Dr. SubraMANI Paramasivam. Think & Work like a Data Scientist with SQL 2016 & R

Using Existing Numerical Libraries on Spark

software.sci.utah.edu (Select Visitors)

Using Numerical Libraries on Spark

Proven video conference management software for Cisco Meeting Server

SCI - software.sci.utah.edu (Select Visitors)

Streaming iphone sensor data to SAS Event Stream Processing

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. 20 November 2017 (12:10 GMT) Beginner.

Tracking the Internet s BGP Table

Python With Data Science

Docker and HPE Accelerate Digital Transformation to Enable Hybrid IT. Steven Follis Solutions Engineer Docker Inc.

Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS

This report is based on sampled data. Jun 1 Jul 6 Aug 10 Sep 14 Oct 19 Nov 23 Dec 28 Feb 1 Mar 8 Apr 12 May 17 Ju

ADVANCED ANALYTICS USING SAS ENTERPRISE MINER RENS FEENSTRA

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Scalable Data Science in R and Apache Spark 2.0. Felix Cheung, Principal Engineer, Microsoft

Personalizing Netflix with Streaming datasets

COMP6237 Data Mining Introduction to Data Mining

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

2016 Market Update. Gary Keller and Jay Papasan Keller Williams Realty, Inc.

COURSE LISTING. Courses Listed. Training for Database & Technology with Modeling in SAP HANA. Last updated on: 30 Nov 2018.

Jordan Levesque - Keeping your Business Secure

BUILT FOR THE SPEED OF BUSINESS

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

COPYRIGHT DATASHEET

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Oracle Big Data Connectors

COURSE LISTING. Courses Listed. with HANA Programming. 13 February 2018 (04:51 GMT) HA100 - SAP HANA

Contents. Preface to the Second Edition

COURSE LISTING. Courses Listed. with SAP Hybris Marketing Cloud. 24 January 2018 (23:53 GMT) HY760 - SAP Hybris Marketing Cloud

Specialist ICT Learning

Decision Making Information from Your Mobile Device with Today's Rockwell Software

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Dan Lobb CRISC Lisa Gable CISM Katie Friebus

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone

Deploying, Managing and Reusing R Models in an Enterprise Environment

Scaling on one node Hybrid engines with Multi-GPU on In-Memory database queries

Advanced Data Modeling: Be Happier, Add More Value and Be More Valued

VxRail: Level Up with New Capabilities and Powers GLOBAL SPONSORS

Control Center Release Notes

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Deep Learning & Accelerating the NLP Journey in the Unstructured World

DISPLAYPORT TECHNOLOGY UPDATE. Jim Choate VESA Compliance Program Manager June 15, 2017

Data Science Bootcamp Curriculum. NYC Data Science Academy

Imputation for missing observation through Artificial Intelligence. A Heuristic & Machine Learning approach

SIEM Solutions from McAfee

KNIME for the life sciences Cambridge Meetup

Oracle Machine Learning Notebook

Data Sources for Cyber Security Research

Inter-Domain Routing Trends

Distributed Machine Learning" on Spark

Overview and Practical Application of Machine Learning in Pricing

Distributed Computing with Spark and MapReduce

Transforming Transport Infrastructure with GPU- Accelerated Machine Learning Yang Lu and Shaun Howell

Instructor : Dr. Sunnie Chung. Independent Study Spring Pentaho. 1 P a g e

Faster, Better, and Cheaper? Building the SD-WAN Business Case

AI & Machine Learning at Amazon

Instructor training course schedule v3 Confirmed courses due completion by 31 st July 2019

No domain left behind

Fluentd + MongoDB + Spark = Awesome Sauce

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Taking Your Application Design to the Next Level with Data Mining

Scalable Machine Learning in R. with H2O

microsoft

e-sens Nordic & Baltic Area Meeting Stockholm April 23rd 2013

Exam Questions

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Excel Functions & Tables

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

Machine Learning in WAN Research

Analytics and Visualization

Monitoring Your Operations David Jacob

ACTIVE MICROSOFT CERTIFICATIONS:

Machine Learning in WAN Research

10 Key Things Your VoIP Firewall Should Do. When voice joins applications and data on your network

Opportunities and challenges in personalization of online hotel search

COURSE LISTING. Courses Listed. Training for Cloud with SAP Cloud Platform in Development. 23 November 2017 (08:12 GMT) Beginner.

Distributed Computing with Spark

COURSE LISTING. Courses Listed. Training for Database & Technology with Development in SAP Cloud Platform. 1 December 2017 (22:41 GMT) Beginner

VESA Display Standards Updates

Broadband Rate Design for Public Benefit

Apache Spark and Scala Certification Training

The Power of Prediction: Cloud Bandwidth and Cost Reduction

Computational Databases: Inspirations from Statistical Software. Linnea Passing, Technical University of Munich

SCIENCE. An Introduction to Python Brief History Why Python Where to use

Mahout in Action MANNING ROBIN ANIL SEAN OWEN TED DUNNING ELLEN FRIEDMAN. Shelter Island

Flash Storage Complementing a Data Lake for Real-Time Insight

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

EVM & Project Controls Applied to Manufacturing

Why Most IoT Projects Fail And how to ensure success with OSIsoft and Cisco Kinetic

Scaled Machine Learning at Matroid

A Brief Introduction to Data Mining

Templates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU

Introducing Oracle Machine Learning

Scoring Outside the Box Nascif Abousalh-Neto, JMP Principal Software Developer, SAS

Transcription:

Practical Machine Learning Agenda Starting From Log Management Moving To Machine Learning PunchPlatform team Thales Challenges Thanks 1

Starting From Log Management 2

Starting From Log Management Data logs, documents, events, Alerting Batch Processing Y Y Y Y Y Searching Visualising Reporting Data logs, documents, events, Speed Processing - Solve all the operational issues to collect, transport and transform the data - Make it easy for the user to compose its functional pipeline - Provide end-to-end deployment, supervision and monitoring Y 3

Log Management Technical Pipeline Data Processing Raw Data Parsing Normalising Forwarding Indexing Kafka Storm Kafka Storm ElasticSearch Searching Visualising Reporting Alerting - Powered by well-known COTS, user adoption is key - Make it possible to plug in arbitrary processing - No loss of data ever Ceph Long Term Storage 4

End User Experience - Forensics : searching, reporting, aggregating - Real-time, dynamic - Already a data scientist starter tool 5

CyberSecurity Platforms Architecture Boston Paris Sydney Singapore - automatically deployed - yearly updates, with no service interruption - No loss of data ever - 16 production platforms dns/ldap/etc storm zookeeper kafka elasticsearch ceph 6

Data is at hand - 1 year of Indexed and normalised logs in Elasticsearch - 1 year of compressed raw logs in distributed object storage - days (up to a month) of normalised data in Kafka The Punchplatform architecture and connectors make it simple and safe to deploy arbitrary processing on the data. Either batch or real-time (streaming). jan feb mar apr may jun jul aug sep oct nov dec 1-3 months are replicated : Each log is stored on 2 (or more) Elasticsearch servers. 4-12 months are not replicated : but online and indexed in ElasticSearch. 1-12 months are replicated. Each log is stored on 2 (or more) servers. 7

Moving To Machine Learning 8

Live Demo 9

Live Demo Explained Data Processing Raw Data Parsing Normalising Alerting Indexing Kafka Storm Kafka Storm ElasticSearch Searching Visualising Reporting Alerting Data Analytics: Machine Learning Anomaly Detection 10

Challenges 11

Our Process - solve the data access and ownership issue - small, agile, integrated team - well defined process and something like a - clear and shared vision of achievable steps : MVPs Raw Data Data Scientist CyberSecurity Analyst Feature Engineering clean/normalise Train/Test feature Predict model 12

Making (Spark) ML simpler Input Data Scored Data Plan design Fit Transform View Evaluate Alert Fit Every Night on last day of data Transform last hour of Data every half an hour - By configuration, not by code - Leverage data normalisation, off the shelves libraries - Quick to setup and test Model Detection Data 13

Fit Job Example Punch Stage Field selection, enrichment,.. Data Input ElasticSearch last day of firewall logs Spark Stage SparkQL Spark Stage K-Means, Regression, Spark Stage Filter Model Output Save the computed Model - This is defined in a plain Json configuration file 14

Transform Job Example Model Input from Fit Job Punch Stage Field selection, filtering, Live Input Real Time Firewall Logs Spark Stage SparkQL Spark Stage (say) K-Means/Regression/ Spark Stage select fields Data Output Save the scored data to ElasticSearch - This is defined in a plain Json configuration file 15

Leveraging Spark MLib https://spark.apache.org/mllib/ ML Algorithms Classification: logistic regression, naive Bayes,... Regression: generalized linear regression, survival regression,... Decision trees, random forests, gradient-boosted trees Recommendation: alternating least squares (ALS) Clustering: K-means, Gaussian mixtures (GMMs),... Topic modeling: latent Dirichlet allocation (LDA) Frequent itemsets, association rules, sequential pattern mining ML Workflow Feature transformations: standardization, normalization, hashing,... ML Pipeline construction Model evaluation and hyper-parameter tuning ML persistence: saving and loading models and Pipelines Utilities Distributed linear algebra: SVD, PCA,... Statistics: summary statistics, hypothesis testing,... 16

Conclusions When embarking on AI projects you dramatically improve your chances of producing value by : Operating in a build now, learn as you go fashion. Truly sophisticated products are arrived at via iteration and variation; not naive designs steeped in theory; Using nascent discoveries only in the context of a working product; Encouraging Agility from your Data Scientists as much as your developers and product managers; Closing the gap between lab and factory wherever possible, favoring quick and lean solutions that grow more valid with time; Leveraging the machine learning already available in open source tools, only coding from the ground up when absolutely necessary; Passing user feedback into your data pipelines by exposing imperfect models to end users early. (Sean McClure) 17

Thanks! http://doc.punchplatform.com http://punchplatfom.io http://kibana.punchplatform.com