Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology

Size: px
Start display at page:

Download "Data science for the datacenter: processing logs with Apache Spark. William Benton Red Hat Emerging Technology"

Transcription

1 Data science for the datacenter: processing logs with Apache Spark William Benton Red Hat Emerging Technology

2 BACKGROUND

3 Challenges of log data

4 Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour

5 Challenges of log data SELECT hostname, DATEPART(HH, timestamp) AS hour, COUNT(msg) FROM LOGS WHERE level='crit' AND msg LIKE '%failure%' GROUP BY hostname, hour 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00

6 Challenges of log data postgres WARN CRIT DEBUG httpd GET GET GET POST GET (404) syslog WARN WARN (ca. 2000)

7 Challenges of log data postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT 2016) Django WARN syslog (ca. WARN haproxy k8s WARN DEBUG CRIT WARN WARN

8 Challenges of log data postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN haproxy postgres httpd syslog k8s WARN CRIT GET GET GET POST WARN WARN DEBUG CRIT WARN WARN Cassandra nginx Rails CRIT GET POST PUT POST WARN How many services are CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT CouchDB httpd GET CRIT GET (404) POST redis CRIT httpd PUT (500) GET PUT generating logs in your Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN datacenter today? postgres WARN CRIT Cassandra CRIT postgres WARN CRIT Cassandra CRIT httpd GET GET GET POST nginx GET POST PUT POST httpd GET GET GET POST nginx GET POST PUT POST syslog WARN Rails WARN syslog WARN Rails WARN CouchDB CRIT redis CRIT CouchDB CRIT redis CRIT httpd GET GET (404) POST httpd PUT (500) GET PUT httpd GET GET (404) POST httpd PUT (500) GET PUT Django WARN syslog WARN Django WARN syslog WARN haproxy WARN DEBUG CRIT haproxy WARN DEBUG CRIT k8s WARN WARN k8s WARN WARN

9 Apache Spark Graph Query ML Streaming Spark core (distributed collections, scheduler) ad hoc Mesos YARN

10 DATA INGEST

11 Collecting log data collecting Ingesting live log data via rsyslog, logstash, fluentd

12 Collecting log data collecting normalizing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources

13 Collecting log data collecting normalizing warehousing Ingesting live log data via rsyslog, logstash, fluentd Reconciling log record metadata across sources Storing normalized records in ES indices

14 Collecting log data warehousing Storing normalized records in ES indices

15 Collecting log data warehousing analysis Storing normalized records in ES indices cache warehoused data as Parquet files on Gluster volume local to Spark cluster

16 Schema mediation

17 Schema mediation

18 Schema mediation

19 Schema mediation timestamp, level, host, IP addresses, message, &c. rsyslog-style metadata, like app name, facility, &c.

20 Structured queries at scale logs.select("level").distinct.as[string].collect

21 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert

22 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc)

23 Structured queries at scale logs.select("level").distinct.as[string].collect debug, notice, emerg, err, warning, crit, info, severe, alert logs.groupby($"level", $"rsyslog.app_name").agg(count("level").as("total")).orderby($"total".desc) kubelet kube-proxy ERR journal systemd

24 FEATURE ENGINEERING

25 From log records to vectors What does it mean for two log records to be similar?

26 From log records to vectors What does it mean for two log records to be similar? red green blue orange

27 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001

28 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns

29 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > 00010

30 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > red pancakes orange waffles

31 From log records to vectors What does it mean for two log records to be similar? red green blue orange -> 000 -> 010 -> 100 -> 001 pancakes waffles aebliskiver omelets bacon hash browns -> > > > > > red pancakes orange waffles -> >

32 From log records to vectors What does it mean for two log records to be similar?

33 From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: " :00:01",...} {level: DEBUG, hostname: barney, timestamp: " :00:02",...} {level:, hostname: fred, timestamp: " :00:03",...} {level: ERR, hostname: barney, timestamp: " :00:03",...} {level: ALERT, hostname: wilma, timestamp: " :00:03",...}

34 From log records to vectors What does it mean for two log records to be similar? {level:, hostname: fred, timestamp: " :00:01",...} {level: DEBUG, hostname: barney, timestamp: " :00:02",...} {level:, hostname: fred, timestamp: " :00:03",...} {level: ERR, hostname: barney, timestamp: " :00:03",...} {level: ALERT, hostname: wilma, timestamp: " :00:03",...} -> > > > >

35 Similarity and distance

36 Similarity and distance

37 Similarity and distance (q - p) (q - p)

38 Similarity and distance n p i - q i i=1

39 Similarity and distance n p i - q i i=1

40 Similarity and distance p q p q

41 Similarity and distance

42 Similarity and distance

43 Similarity and distance

44 Similarity and distance =.7

45 Other interesting features host01 WARN WAR host02 WARN DEBUG host03 WARN WARN

46 Other interesting features host01 WARN host02 DEBUG WARN DEBUG host03 FO WARN

47 Other interesting features host01 DEBUG WARN BUG host02 WARN host03 WARN WARN WARN WARN

48 Other interesting features : Everything is great! Just checking in to let you know I m OK.

49 Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers.

50 Other interesting features : Everything is great! Just checking in to let you know I m OK. CRIT: No requests in last hour; suspending running app containers. : Phoenix datacenter is on fire; may not rise from ashes.

51 NATURAL LANGUAGE and LOG DATA

52 Introducing word2vec Madrid Spain France

53 Introducing word2vec Madrid Spain France

54 Introducing word2vec Madrid Spain France

55 Introducing word2vec Madrid Spain France

56 Introducing word2vec Madrid Spain France

57 Introducing word2vec Madrid Spain Paris France v("madrid") - v("spain") + v("france") v("paris")

58 Preprocessing text What is a word? How are log words different from words found in conventional prose?

59 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot

60 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd etcd ZooKeeper

61 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null etcd ZooKeeper

62 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot systemd /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper

63 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd OPEN_MAX /dev/null OutOfMemoryError Kernel::exec etcd ZooKeeper

64 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT systemd /dev/null OPEN_MAX OutOfMemoryError Kernel::exec etcd ZooKeeper

65 Preprocessing text What is a word? How are log words different from words found in conventional prose? Apache Oozie Java HotSpot ENOENT QEMU systemd /dev/null OPEN_MAX OutOfMemoryError Kernel::exec etcd ZooKeeper

66 // assume messages is an RDD of log message texts val oneletter = new scala.util.matching.regex(".*([a-za-z]).*") def tokens(s: String, post: String=>String): Seq[String] = collapsewhitespace(s).split(" ").map(s => post(strippunctuation(s))).collect { case oneletter(_) => token } val tokenseqs = messages.map(line => tokens(line), identity[string]) val w2v = new org.apache.spark.mllib.feature.word2vec val model = w2v.fit(tokenseqs)

67 Insights from log messages

68 Insights from log messages nova

69 Insights from log messages containers glance nova vm images instances

70 Insights from log messages password containers glance nova vm images instances

71 Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images

72 Insights from log messages opened password publickey accepted containers glance nova IPMI instances vm images sh

73 Insights from log messages opened password publickey accepted containers glance nova IPMI vm images instances AUDIT_SESSION /usr/bin/bash sh

74 VISUALIZING STRUCTURE and FINDING OUTLIERS

75 Multidimensional data

76 Multidimensional data [4,7]

77 Multidimensional data [4,7]

78 Multidimensional data [4,7] [2,3,5]

79 Multidimensional data [4,7] [2,3,5]

80 Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]

81 Multidimensional data [7,1,6,5,12, [4,7] [2,3,5] 8,9,2,2,4, 7,11,6,1,5]

82 A linear approach: PCA

83 A linear approach: PCA

84

85 A nonlinear approach: t-sne

86 A nonlinear approach: t-sne p( )

87 A nonlinear approach: t-sne p( ) p( )

88

89 Tree-based approaches

90 Tree-based approaches if red yes if orange if!red no if!gray yes if!orange if!gray no

91 Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray

92 Tree-based approaches yes yes yes yes no no no no no yes no yes if red yes yes no yes no yes yes yes yes if orange if!red no no no no yes no no no yes yes no yes no if!gray yes yes no yes no yes no yes no if!orange no yes no yes yes no yes no no if!gray

93 Self-organizing maps

94 Self-organizing maps

95 Training SOMs

96 Training SOMs

97 Training SOMs

98 Training SOMs

99 Self-organizing maps

100 Self-organizing maps

101 Self-organizing maps

102 Self-organizing maps

103 Self-organizing maps

104 Self-organizing maps

105 Self-organizing maps

106 Self-organizing maps

107 Self-organizing maps

108 Finding outliers with SOMs

109 Finding outliers with SOMs

110 Finding outliers with SOMs

111 Finding outliers with SOMs

112 Finding outliers with SOMs

113

114 Out of 310 million log records, we identified % as outliers.

115

116

117 Thirty most extreme outliers 10 Can not communicate with power supply 2. 9 Power supply 2 failed. 8 Power supply redundancy is lost. 1 Drive A is removed. 1 Can not communicate with power supply 1. 1 Power supply 1 failed.

118 LESSONS LEARNED and KEY TAKEAWAYS

119 Spark and ElasticSearch Data locality is an issue and caching is even more important than when running from local storage. The best practice is to warehouse ES indices to Parquet files and write Spark jobs against them.

120 Structured queries in Spark Always program defensively: mediate schemas, explicitly convert null values, etc. Use the Dataset API whenever possible to minimize boilerplate and benefit from query planning without (entirely) forsaking type safety.

121 Natural language tips Conventional preprocessing pipelines probably won t get you the best results on log message texts. Consider the sorts of words and word variants you ll want to recognize with your preprocessing pipeline. Use an approximate whitelist for unusual words.

122 Memory and partitioning Large JVM heaps can lead to appalling GC pauses. You probably won t be able to use all of your memory in a single JVM without executor timeouts. Consider memory budget at the driver when choosing a partitioning strategy.

123 Feature engineering Design your feature engineering pipeline so you can translate feature vectors back to factor values. Favor feature engineering effort over complex or novel learning algorithms. Prefer approaches that train interpretable models.

124

Some things you learn running Apache Spark in production for three years. William Benton Red Hat, Inc.

Some things you learn running Apache Spark in production for three years. William Benton Red Hat, Inc. Some things you learn running Apache Spark in production for three years William Benton (@willb) Red Hat, Inc. Some things you learn running Apache Spark in production for three years William Benton (@willb)

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

CONTAINERIZED SPARK ON KUBERNETES. William Benton Red Hat,

CONTAINERIZED SPARK ON KUBERNETES. William Benton Red Hat, CONTAINERIZED SPARK ON KUBERNETES William Benton Red Hat, Inc. @willb willb@redhat.com BACKGROUND BACKGROUND BACKGROUND BACKGROUND BACKGROUND BACKGROUND BACKGROUND BACKGROUND WHAT OUR SPARK CLUSTER LOOKED

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc.

An Introduction to Apache Spark Big Data Madison: 29 July William Red Hat, Inc. An Introduction to Apache Spark Big Data Madison: 29 July 2014 William Benton @willb Red Hat, Inc. About me At Red Hat for almost 6 years, working on distributed computing Currently contributing to Spark,

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Accelerating Spark Workloads using GPUs

Accelerating Spark Workloads using GPUs Accelerating Spark Workloads using GPUs Rajesh Bordawekar, Minsik Cho, Wei Tan, Benjamin Herta, Vladimir Zolotov, Alexei Lvov, Liana Fong, and David Kung IBM T. J. Watson Research Center 1 Outline Spark

More information

The Art of Container Monitoring. Derek Chen

The Art of Container Monitoring. Derek Chen The Art of Container Monitoring Derek Chen 2016.9.22 About me DevOps Engineer at Trend Micro Agile transformation Micro service and cloud service Docker integration Monitoring system development Automate

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source DMITRIY SETRAKYAN Founder, PPMC https://ignite.apache.org @apacheignite @dsetrakyan Agenda About In- Memory Computing Apache Ignite

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

Turning Relational Database Tables into Spark Data Sources

Turning Relational Database Tables into Spark Data Sources Turning Relational Database Tables into Spark Data Sources Kuassi Mensah Jean de Lavarene Director Product Mgmt Director Development Server Technologies October 04, 2017 3 Safe Harbor Statement The following

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

BUILDING CLOUD NATIVE APACHE SPARK APPLICATIONS WITH OPENSHIFT. Michael McCune 11 January 2017

BUILDING CLOUD NATIVE APACHE SPARK APPLICATIONS WITH OPENSHIFT. Michael McCune 11 January 2017 BUILDING CLOUD NATIVE APACHE SPARK APPLICATIONS WITH OPENSHIFT Michael McCune 11 January 2017 1 INTRODUCTION A little about me Embedded to Orchestration Red Hat emerging technologies OpenStack Sahara Oshinko

More information

Big Data Analytics. Description:

Big Data Analytics. Description: Big Data Analytics Description: With the advance of IT storage, pcoressing, computation, and sensing technologies, Big Data has become a novel norm of life. Only until recently, computers are able to capture

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Application monitoring with BELK. Nishant Sahay, Sr. Architect Bhavani Ananth, Architect

Application monitoring with BELK. Nishant Sahay, Sr. Architect Bhavani Ananth, Architect Application monitoring with BELK Nishant Sahay, Sr. Architect Bhavani Ananth, Architect Why logs Business PoV Input Data Analytics User Interactions /Behavior End user Experience/ Improvements 2017 Wipro

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

@joerg_schad Nightmares of a Container Orchestration System

@joerg_schad Nightmares of a Container Orchestration System @joerg_schad Nightmares of a Container Orchestration System 2017 Mesosphere, Inc. All Rights Reserved. 1 Jörg Schad Distributed Systems Engineer @joerg_schad Jan Repnak Support Engineer/ Solution Architect

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Databases and Big Data Today. CS634 Class 22

Databases and Big Data Today. CS634 Class 22 Databases and Big Data Today CS634 Class 22 Current types of Databases SQL using relational tables: still very important! NoSQL, i.e., not using relational tables: term NoSQL popular since about 2007.

More information

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Ingest. David Pilato, Developer Evangelist Paris, 31 Janvier 2017

Ingest. David Pilato, Developer Evangelist Paris, 31 Janvier 2017 Ingest David Pilato, Developer Evangelist Paris, 31 Janvier 2017 Data Ingestion The process of collecting and importing data for immediate use in a datastore 2 ? Simple things should be simple. Shay Banon

More information

A day in the life of a log message Kyle Liberti, Josef

A day in the life of a log message Kyle Liberti, Josef A day in the life of a log message Kyle Liberti, Josef Karasek @Pepe_CZ Order is vital for scale Abstractions make systems manageable Problems of Distributed Systems Reliability Data throughput Latency

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

TRASH DAY: COORDINATING GARBAGE COLLECTION IN DISTRIBUTED SYSTEMS

TRASH DAY: COORDINATING GARBAGE COLLECTION IN DISTRIBUTED SYSTEMS TRASH DAY: COORDINATING GARBAGE COLLECTION IN DISTRIBUTED SYSTEMS Martin Maas* Tim Harris KrsteAsanovic* John Kubiatowicz* *University of California, Berkeley Oracle Labs, Cambridge Why you should care

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

Hadoop Development Introduction

Hadoop Development Introduction Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand

More information

Parallelism with the PDI AEL Spark Engine. Jonathan Jarvis Pentaho Senior Solution Engineer, Hitachi Vantara

Parallelism with the PDI AEL Spark Engine. Jonathan Jarvis Pentaho Senior Solution Engineer, Hitachi Vantara Parallelism with the PDI AEL Spark Engine Jonathan Jarvis Pentaho Senior Solution Engineer, Hitachi Vantara Agenda Why AEL Spark? Kettle Engine Spark 101 PDI on Spark Takeaways Why AEL Spark? Spark has

More information

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM

Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM Kafka pours and Spark resolves! Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With Spark since 2014 With EPAM since 2015 About Contacts E-mail

More information

Kafka Connect the Dots

Kafka Connect the Dots Kafka Connect the Dots Building Oracle Change Data Capture Pipelines With Kafka Mike Donovan CTO Dbvisit Software Mike Donovan Chief Technology Officer, Dbvisit Software Multi-platform DBA, (Oracle, MSSQL..)

More information

Rethinking monitoring with Prometheus

Rethinking monitoring with Prometheus Rethinking monitoring with Prometheus Martín Ferrari Štefan Šafár http://tincho.org @som_zlo Who is Prometheus? A dude who stole fire from Mt. Olympus and gave it to humanity http://prometheus.io/ What

More information

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK?

COSC 6339 Big Data Analytics. Introduction to Spark. Edgar Gabriel Fall What is SPARK? COSC 6339 Big Data Analytics Introduction to Spark Edgar Gabriel Fall 2018 What is SPARK? In-Memory Cluster Computing for Big Data Applications Fixes the weaknesses of MapReduce Iterative applications

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap Topics Who - Key Hashmap Team Members The Use Case - Our Need for a Memory

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

Container Orchestration on Amazon Web Services. Arun

Container Orchestration on Amazon Web Services. Arun Container Orchestration on Amazon Web Services Arun Gupta, @arungupta Docker Workflow Development using Docker Docker Community Edition Docker for Mac/Windows/Linux Monthly edge and quarterly stable

More information

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3

WHITE PAPER. Apache Spark: RDD, DataFrame and Dataset. API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 WHITE PAPER Apache Spark: RDD, DataFrame and Dataset API comparison and Performance Benchmark in Spark 2.1 and Spark 1.6.3 Prepared by: Eyal Edelman, Big Data Practice Lead Michael Birch, Big Data and

More information

Hortonworks Data Platform

Hortonworks Data Platform Hortonworks Data Platform Workflow Management (August 31, 2017) docs.hortonworks.com Hortonworks Data Platform: Workflow Management Copyright 2012-2017 Hortonworks, Inc. Some rights reserved. The Hortonworks

More information

Technical Sheet NITRODB Time-Series Database

Technical Sheet NITRODB Time-Series Database Technical Sheet NITRODB Time-Series Database 10X Performance, 1/10th the Cost INTRODUCTION "#$#!%&''$!! NITRODB is an Apache Spark Based Time Series Database built to store and analyze 100s of terabytes

More information

Using Apache Spark for generating ElasticSearch indices offline

Using Apache Spark for generating ElasticSearch indices offline Using Apache Spark for generating ElasticSearch indices offline Andrej Babolčai ESET Database systems engineer Apache: Big Data Europe 2016 Who am I Software engineer in database systems team Responsible

More information

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Ingest. Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017

Ingest. Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017 Ingest Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017 Data Ingestion The process of collecting and importing data for immediate use 2 ? Simple things should be simple. Shay Banon Elastic{ON}

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

E l a s t i c s e a r c h F e a t u r e s. Contents

E l a s t i c s e a r c h F e a t u r e s. Contents Elasticsearch Features A n Overview Contents Introduction... 2 Location Based Search... 2 Search Social Media(Twitter) data from Elasticsearch... 4 Query Boosting in Elasticsearch... 4 Machine Learning

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark Anastasios Skarlatidis @anskarl Software Engineer/Researcher IIT, NCSR "Demokritos" Outline Part I: Getting to know Spark Part II: Basic programming Part III: Spark under

More information

Hadoop course content

Hadoop course content course content COURSE DETAILS 1. In-detail explanation on the concepts of HDFS & MapReduce frameworks 2. What is 2.X Architecture & How to set up Cluster 3. How to write complex MapReduce Programs 4. In-detail

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark Agenda Spark Platform Spark Core Spark Extensions Using Apache Spark About me Vitalii Bondarenko Data Platform Competency Manager Eleks www.eleks.com 20 years in software development 9+ years of developing

More information

100% Containers Powered Carpooling

100% Containers Powered Carpooling 100% Containers Powered Carpooling Maxime Fouilleul Database Reliability Engineer BlaBlaCar - Facts & Figures Today s agenda Infrastructure Ecosystem - 100% containers powered carpooling Stateful Services

More information

Cloud Computing & Visualization

Cloud Computing & Visualization Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International

More information

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data Introduction to Hadoop High Availability Scaling Advantages and Challenges Introduction to Big Data What is Big data Big Data opportunities Big Data Challenges Characteristics of Big data Introduction

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

The Many Faces Of Apache Ignite. David Robinson, Software Engineer May 13, 2016

The Many Faces Of Apache Ignite. David Robinson, Software Engineer May 13, 2016 The Many Faces Of Apache Ignite David Robinson, Software Engineer May 13, 2016 A Face In elementary geometry, a face is a two-dimensional polygon on the boundary of a polyhedron. 2 Attribution:Robert Webb's

More information

Seagull: A distributed, fault tolerant, concurrent task runner. Sagar Patwardhan

Seagull: A distributed, fault tolerant, concurrent task runner. Sagar Patwardhan Seagull: A distributed, fault tolerant, concurrent task runner Sagar Patwardhan sagarp@yelp.com Yelp s Mission Connecting people with great local businesses. Yelp scale Outline What is Seagull? Why did

More information

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch Nick Pentreath Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning

More information

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Analytics using Apache Hadoop and Spark with Scala Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,

More information

SCALING LIKE TWITTER WITH APACHE MESOS

SCALING LIKE TWITTER WITH APACHE MESOS Philip Norman & Sunil Shah SCALING LIKE TWITTER WITH APACHE MESOS 1 MODERN INFRASTRUCTURE Dan the Datacenter Operator Alice the Application Developer Doesn t sleep very well Loves automation Wants to control

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,

More information

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 22. Spark. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 22. Spark Paul Krzyzanowski Rutgers University Fall 2016 November 26, 2016 2015-2016 Paul Krzyzanowski 1 Apache Spark Goal: generalize MapReduce Similar shard-and-gather approach to

More information

Evolving & Supporting Stateful, Multi-Tenant Decisioning Applications in Production. B. Frazier, K. Gasser & G. Mead, Software Engineers, Capital One

Evolving & Supporting Stateful, Multi-Tenant Decisioning Applications in Production. B. Frazier, K. Gasser & G. Mead, Software Engineers, Capital One Evolving & Supporting Stateful, Multi-Tenant Decisioning Applications in Production B. Frazier, K. Gasser & G. Mead, Software Engineers, Capital One Agenda Intro (Keith) Cluster Installation and Operations:

More information

Scaling Instagram. AirBnB Tech Talk 2012 Mike Krieger Instagram

Scaling Instagram. AirBnB Tech Talk 2012 Mike Krieger Instagram Scaling Instagram AirBnB Tech Talk 2012 Mike Krieger Instagram me - Co-founder, Instagram - Previously: UX & Front-end @ Meebo - Stanford HCI BS/MS - @mikeyk on everything communicating and sharing

More information

Red Hat OpenStack Platform 12

Red Hat OpenStack Platform 12 Red Hat OpenStack Platform 12 Monitoring Tools Configuration Guide A guide to OpenStack logging and monitoring tools Last Updated: 2018-05-24 Red Hat OpenStack Platform 12 Monitoring Tools Configuration

More information

(incubating) Introduction. Swapnil Bawaskar.

(incubating) Introduction. Swapnil Bawaskar. (incubating) Introduction William Markito @william_markito Swapnil Bawaskar @sbawaskar Agenda Introduction What? Who? Why? How? DEBS Roadmap Q&A 2 3 Introduction Introduction A distributed, memory-based

More information

Performance and Scalability Overview

Performance and Scalability Overview Performance and Scalability Overview This guide provides an overview of some of the performance and scalability capabilities of the Pentaho Business Anlytics platform PENTAHO PERFORMANCE ENGINEERING TEAM

More information

Cloudera Kudu Introduction

Cloudera Kudu Introduction Cloudera Kudu Introduction Zbigniew Baranowski Based on: http://slideshare.net/cloudera/kudu-new-hadoop-storage-for-fast-analytics-onfast-data What is KUDU? New storage engine for structured data (tables)

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

Red Hat OpenStack Platform 13

Red Hat OpenStack Platform 13 Red Hat OpenStack Platform 13 Monitoring Tools Configuration Guide A guide to OpenStack logging and monitoring tools Last Updated: 2019-01-29 Red Hat OpenStack Platform 13 Monitoring Tools Configuration

More information

Building an on premise Kubernetes cluster DANNY TURNER

Building an on premise Kubernetes cluster DANNY TURNER Building an on premise Kubernetes cluster DANNY TURNER Outline What is K8s? Why (not) run k8s? Why run our own cluster? Building what the public cloud provides 2 Kubernetes Open-Source Container Management

More information

Workload Experience Manager

Workload Experience Manager Workload Experience Manager Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are

More information

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017 Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017 About the Presentation Problems Existing Solutions Denis Magda

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Apache Storm. Hortonworks Inc Page 1

Apache Storm. Hortonworks Inc Page 1 Apache Storm Page 1 What is Storm? Real time stream processing framework Scalable Up to 1 million tuples per second per node Fault Tolerant Tasks reassigned on failure Guaranteed Processing At least once

More information

Apache Flink Big Data Stream Processing

Apache Flink Big Data Stream Processing Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017

More information

Monitor your containers with the Elastic Stack. Monica Sarbu

Monitor your containers with the Elastic Stack. Monica Sarbu Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team monica@elastic.co 3 Monitor your containers with the Elastic Stack Elastic Stack 5 Beats are lightweight shippers

More information

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

Tuning Enterprise Information Catalog Performance

Tuning Enterprise Information Catalog Performance Tuning Enterprise Information Catalog Performance Copyright Informatica LLC 2015, 2018. Informatica and the Informatica logo are trademarks or registered trademarks of Informatica LLC in the United States

More information

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences

L3: Spark & RDD. CDS Department of Computational and Data Sciences. Department of Computational and Data Sciences Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences L3: Spark & RDD Department of Computational and Data Science, IISc, 2016 This

More information

Think Small to Scale Big

Think Small to Scale Big Think Small to Scale Big Intro to Containers for the Datacenter Admin Pete Zerger Principal Program Manager, MVP pete.zerger@cireson.com Cireson Lee Berg Blog, e-mail address, title Company Pete Zerger

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

The Evolution of a Data Project

The Evolution of a Data Project The Evolution of a Data Project The Evolution of a Data Project Python script The Evolution of a Data Project Python script SQL on live DB The Evolution of a Data Project Python script SQL on live DB SQL

More information

Jupyter and Spark on Mesos: Best Practices. June 21 st, 2017

Jupyter and Spark on Mesos: Best Practices. June 21 st, 2017 Jupyter and Spark on Mesos: Best Practices June 2 st, 207 Agenda About me What is Spark & Jupyter Demo How Spark+Mesos+Jupyter work together Experience Q & A About me Graduated from EE @ Tsinghua Univ.

More information