BIG DATA REVOLUTION IN JOBRAPIDO

Size: px
Start display at page:

Download "BIG DATA REVOLUTION IN JOBRAPIDO"

Transcription

1 BIG DATA REVOLUTION IN JOBRAPIDO Michele Pinto Big Data Technical Team Jobrapido Big Data Tech 2016 Firenze - October 20, 2016

2 ABOUT ME NAME Michele Pinto LINKEDIN COMPANY WEBSITE

3 WHO WE ARE VISITORS 1.0 BN visits / year UNIQUE VISITORS 35 Mio Uvs / month Jobrapido is the world's leading jobsearch engine that analyses and collects all job posts on the web, giving jobseekers all offers available, ordered for relevance based on the search they ve done SUBSCRIBERS 70+ Mio subs users (current stock) PAGEVIEWS / CLICKS* 280 Mio PVs / month & 130 Mio clicks / month JOBS 20+ Mio jobs at any given time Response Aggregation Analysis WEBSITES IN 58 COUNTRIES Head office Milan + office in Amsterdam PEOPLE 100+ * Clicks on job listings (organic + sponsored) and clicks on contextual ads

4

5 MOBILE APP SIGN IN SIGN UP CNT SELECTION MY SEARCHES MY JOBS MENU

6 WHERE WE ARE

7 THE NEED FOR A BIG DATA ARCHITECTURE (1/2) 7

8 THE NEED FOR A BIG DATA ARCHITECTURE (2/2) MAIN FEATURES: SCALE in terms of throughput and computational power correlated to the data growth rate Unify the tracking layer in a single TRACKING PLATFORM Place and extract data for analytics into a single DATA LAKE REAL-TIME DATA INGESTION in our Data Warehouse Drastically REDUCE COMPLEXITY and MAINTENANCE 8

9 TRACKING PLATFORM 9

10 WHY A NEW TRACKING PLATFORM (TP)? Obtain a unique, simple and scalable Tracking Layer Everyone in Jobrapido should design, track and query its own events Tracking phase and data processing phase totally decoupled Upcoming events queryable and processable in real-time Remove any bottleneck during the event tracking process 10

11 TP: ARCHITECTURAL OVERVIEW 11

12 TP TECHNOLOGIES AVRO (1/3) Data serialization system that provides a compact, fast, binary data format (avro.apache.org) MAIN FEATURES: Serialization into Avro/Binary or Avro/JSON Support for schema evolution: the schema used to read a file does not need to match the schema used to write the file Self-documenting: stores schema in file header Rich schema language defined in JSON Compressible and splittable (good for Spark and Map-Reduce) Can generate Java objects from schemas 12

13 TP TECHNOLOGIES AVRO (2/3) EVERYTHING IS AN EVENT = HEADER + BODY Each event has the same identical header containing some technical fields: What differs between different event types is the body, tracker fills only the body attributes 13

14 TP TECHNOLOGIES AVRO (3/3) BODY: EVERYONE CAN BUILD IT S OWN EVENT (E.G. THE EVENT CLICK) 14

15 TP TECHNOLOGIES KAFKA Kafka enables the capture, movement, processing and storage of data streams in a distributed, fault-tolerant fashion (kafka.apache.org) Events are sent directly to Kafka One topic per event type Retention policy is set to 15 days High-throughput More than 2000 messages /second (AVG) More than 1,5 MB / second (AVG) 15

16 DATA LAKE 16

17 WHY A DATA LAKE? If you think of a data mart as a store of bottled water cleansed and packaged and structured for easy consumption the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. (James Dixon, CTO of Pentaho) MAIN GOALS: Implement a massive storage platform of RAW DATA An immutable MASTER DATA, information is never deleted Store as much data as we want at a very CHEAP PRICE Data must be available for various tasks including reporting, visualization, analytics and machine learning 17

18 DATA LAKE: ARCHITECTURAL OVERVIEW 18

19 DATA LAKE TECHNOLOGIES FLUME (1/2) Distributed data collection service for efficiently collecting and moving large amounts of log data (flume.apache.org) MAIN FEATURES: Distributed, scalable and reliable Contextual and dynamic event routing Fully extensible (plugin architecture) Fully integrated in the Big Data ecosystem Easy to install and configure 19

20 DATA LAKE TECHNOLOGIES FLUME (2/2) FLUME AGENT = SOURCE + [INTERCEPTORS] + CHANNEL + SINK 20

21 REAL-TIME DATA WAREHOUSE INGESTION 21

22 REAL-TIME DATA WAREHOUSE INGESTION (1/2) MAIN GOALS: Data Lake decoupled from Data Warehouse Staging area automatically ingested in real-time Data marts can be refreshed faster No data pipeline to implement or maintain Ingestion automatically scheduled, filtered and parsed JSON events automatically filled in target tables Events are queryable in real-time with the best performance on the market 22

23 REAL-TIME DATA WAREHOUSE INGESTION (2/2) KAFKA AND VERTICA WORK TOGETHER: Vertica acts as a consumer for Kafka (microbatch) Scheduling, filtering, parsing (JSON, Avro, custom) Vertica->Kafka: Vertica is able to send query results to Kafka Monitoring data load activities via Web UI Stream, rates, schedulers, rates, rejections and errors In-database monitoring 23

24 JOBRAPIDO BIG DATA ARCHITECTURE 24

25 WHAT S NEXT Kafka Connect vs Flafka evaluation Enrichment of event streams with Kafka Stream Unleash the power of Spark Integrate Knime with the Data Lake Implement a lot of Data Marts 25

26 GRAZIE 26

#MicroFocusCyberSummit

#MicroFocusCyberSummit #MicroFocusCyberSummit Data Simplicity: ArcSight Data Platform enhances enterprise data via the Common Event Format Peter Titov Micro Focus #MicroFocusCyberSummit Agenda Usage Ingestion Management Solutions

More information

Elastic Stack in A Day Milano 16 Giugno 2016 REVOLUTIONIZE THE WAY PEOPLE GET JOBS WITH ELASTICSEARCH

Elastic Stack in A Day Milano 16 Giugno 2016 REVOLUTIONIZE THE WAY PEOPLE GET JOBS WITH ELASTICSEARCH Elastic Stack in A Day Milano 16 Giugno 2016 REVOLUTIONIZE THE WAY PEOPLE GET JOBS WITH ELASTICSEARCH ABOUT ME NAME Salvatore Vadacca ROLE Head of Technology @ Jobrapido EMAIL salvatore.vadacca@jobrapido.com

More information

Fluentd + MongoDB + Spark = Awesome Sauce

Fluentd + MongoDB + Spark = Awesome Sauce Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision

More information

An Information Asset Hub. How to Effectively Share Your Data

An Information Asset Hub. How to Effectively Share Your Data An Information Asset Hub How to Effectively Share Your Data Hello! I am Jack Kennedy Data Architect @ CNO Enterprise Data Management Team Jack.Kennedy@CNOinc.com 1 4 Data Functions Your Data Warehouse

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

Evolution of an Apache Spark Architecture for Processing Game Data

Evolution of an Apache Spark Architecture for Processing Game Data Evolution of an Apache Spark Architecture for Processing Game Data Nick Afshartous WB Analytics Platform May 17 th 2017 May 17 th, 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka WHITE PAPER Reference Guide for Deploying and Configuring Apache Kafka Revised: 02/2015 Table of Content 1. Introduction 3 2. Apache Kafka Technology Overview 3 3. Common Use Cases for Kafka 4 4. Deploying

More information

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN iway iway Big Data Integrator New Features Bulletin and Release Notes Version 1.5.0 DN3502232.1216 Active Technologies, EDA, EDA/SQL, FIDEL, FOCUS, Information Builders, the Information Builders logo,

More information

Big Data Integration Patterns. Michael Häusler Jun 12, 2017

Big Data Integration Patterns. Michael Häusler Jun 12, 2017 Big Data Integration Patterns Michael Häusler Jun 12, 2017 ResearchGate is built for scientists. The social network gives scientists new tools to connect, collaborate, and keep up with the research that

More information

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component

More information

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale

More information

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco Flexible Network Analytics in the Cloud Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco Introduction Harsh realities of network analytics netbeam Demo

More information

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply New Data Architectures For Netflow Analytics NANOG 74 Fangjin Yang - Cofounder @ Imply The Problem Comparing technologies Overview Operational analytic databases Try this at home The Problem Netflow data

More information

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara Week 1-B-0 Week 1-B-1 CS535 BIG DATA FAQs Slides are available on the course web Wait list Term project topics PART 0. INTRODUCTION 2. DATA PROCESSING PARADIGMS FOR BIG DATA Sangmi Lee Pallickara Computer

More information

Data Acquisition. The reference Big Data stack

Data Acquisition. The reference Big Data stack Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference

More information

Personalizing Netflix with Streaming datasets

Personalizing Netflix with Streaming datasets Personalizing Netflix with Streaming datasets Shriya Arora Senior Data Engineer Personalization Analytics @shriyarora What is this talk about? Helping you decide if a streaming pipeline fits your ETL problem

More information

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned. Yaroslav Tkachenko Senior Data Engineer at Activision

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned. Yaroslav Tkachenko Senior Data Engineer at Activision Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned Yaroslav Tkachenko Senior Data Engineer at Activision 1+ PB Data lake size (AWS S3) Number of topics in the biggest

More information

HUMIT Interactive Data Integration in a Data Lake System for the Life Sciences

HUMIT Interactive Data Integration in a Data Lake System for the Life Sciences HUMIT Interactive Data Integration in a Data Lake System for the Life Sciences PD Dr. Christoph Quix Fraunhofer-Institut für Angewandte Informationstechnik FIT Life Science Informatics Abteilungsleiter

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang A lightweight, pluggable, and scalable ingestion service for real-time data ABSTRACT This paper provides the motivation, implementation details, and evaluation of a lightweight distributed extract-transform-load

More information

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423

More information

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap Topics Who - Key Hashmap Team Members The Use Case - Our Need for a Memory

More information

Certified Big Data and Hadoop Course Curriculum

Certified Big Data and Hadoop Course Curriculum Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation

More information

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET SOLUTION SHEET Syncsort DMX-h Simplifying Big Data Integration Goals of the Modern Data Architecture Data warehouses and mainframes are mainstays of traditional data architectures and still play a vital

More information

Flash Storage Complementing a Data Lake for Real-Time Insight

Flash Storage Complementing a Data Lake for Real-Time Insight Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum

More information

Hortonworks DataFlow Sam Lachterman Solutions Engineer

Hortonworks DataFlow Sam Lachterman Solutions Engineer Hortonworks DataFlow Sam Lachterman Solutions Engineer 1 Hortonworks Inc. 2011 2017. All Rights Reserved Disclaimer This document may contain product features and technology directions that are under development,

More information

Lambda Architecture for Batch and Stream Processing. October 2018

Lambda Architecture for Batch and Stream Processing. October 2018 Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.

More information

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Architectural challenges for building a low latency, scalable multi-tenant data warehouse Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics

More information

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera SOLUTION TRACK Finding the Needle in a Big Data Haystack @EvaAndreasson, Innovator & Problem Solver Cloudera Agenda Problem (Solving) Apache Solr + Apache Hadoop et al Real-world examples Q&A Problem Solving

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics Cy Erbay Senior Director Striim Executive Summary Striim is Uniquely Qualified to Solve the Challenges of Real-Time

More information

Achieve Data Democratization with effective Data Integration Saurabh K. Gupta

Achieve Data Democratization with effective Data Integration Saurabh K. Gupta Achieve Data Democratization with effective Data Integration Saurabh K. Gupta Manager, Data & Analytics, GE www.amazon.com/author/saurabhgupta @saurabhkg Disclaimer: This report has been prepared by the

More information

Data pipelines with PostgreSQL & Kafka

Data pipelines with PostgreSQL & Kafka Data pipelines with PostgreSQL & Kafka Oskari Saarenmaa PostgresConf US 2018 - Jersey City Agenda 1. Introduction 2. Data pipelines, old and new 3. Apache Kafka 4. Sample data pipeline with Kafka & PostgreSQL

More information

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018 Cloudline Autonomous Driving Solutions Accelerating insights through a new generation of Data and Analytics October, 2018 HPE big data analytics solutions power the data-driven enterprise Secure, workload-optimized

More information

PNDA.io: when BGP meets Big-Data

PNDA.io: when BGP meets Big-Data PNDA.io: when BGP meets Big-Data Let s go back in time 26 th April 2017 The Internet is very much alive Millions of BGP events occurring every day 15 Routers Monitored 410 active peers (both IPv4 and IPv6)

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Microsoft Exam

Microsoft Exam Volume: 42 Questions Case Study: 1 Relecloud General Overview Relecloud is a social media company that processes hundreds of millions of social media posts per day and sells advertisements to several hundred

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Data Architectures in Azure for Analytics & Big Data

Data Architectures in Azure for Analytics & Big Data Data Architectures in for Analytics & Big Data October 20, 2018 Melissa Coates Solution Architect, BlueGranite Microsoft Data Platform MVP Blog: www.sqlchick.com Twitter: @sqlchick Data Architecture A

More information

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program

More information

Certified Big Data Hadoop and Spark Scala Course Curriculum

Certified Big Data Hadoop and Spark Scala Course Curriculum Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills

More information

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud Microsoft Azure Databricks for data engineering Building production data pipelines with Apache Spark in the cloud Azure Databricks As companies continue to set their sights on making data-driven decisions

More information

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011 Data Infrastructure at LinkedIn Shirshanka Das XLDB 2011 1 Me UCLA Ph.D. 2005 (Distributed protocols in content delivery networks) PayPal (Web frameworks and Session Stores) Yahoo! (Serving Infrastructure,

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017. Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate

More information

Data Lake Based Systems that Work

Data Lake Based Systems that Work Data Lake Based Systems that Work There are many article and blogs about what works and what does not work when trying to build out a data lake and reporting system. At DesignMind, we have developed a

More information

Data Lakes. IN A Modern Data Architecture

Data Lakes. IN A Modern Data Architecture Data Lakes IN A Modern Data Architecture Data is Big Space is big, Douglas Adams mused in The Hitchhiker s Guide to the Galaxy. Really big. The same can be said of data: It s big. Really big. You might

More information

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data IBM Db2 Event Store Disclaimer The information contained in this presentation is provided for informational purposes only.

More information

Distributed Computing.

Distributed Computing. Distributed Computing at Hai.Thai@rackspace.com About: Me ME About: Me ME 09 Tech grad B.S. Computer Engineering 4 years at rackspace About: Rackspace About: Rackspace Managed + Cloud hosting Cloud Applications:

More information

Modern Data Warehouse The New Approach to Azure BI

Modern Data Warehouse The New Approach to Azure BI Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics

More information

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been

More information

@Pentaho #BigDataWebSeries

@Pentaho #BigDataWebSeries Enterprise Data Warehouse Optimization with Hadoop Big Data @Pentaho #BigDataWebSeries Your Hosts Today Dave Henry SVP Enterprise Solutions Davy Nys VP EMEA & APAC 2 Source/copyright: The Human Face of

More information

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals

More information

IBM Data Replication for Big Data

IBM Data Replication for Big Data IBM Data Replication for Big Data Highlights Stream changes in realtime in Hadoop or Kafka data lakes or hubs Provide agility to data in data warehouses and data lakes Achieve minimum impact on source

More information

Scaling Marketplaces at Thumbtack QCon SF 2017

Scaling Marketplaces at Thumbtack QCon SF 2017 Scaling Marketplaces at Thumbtack QCon SF 2017 Nate Kupp Technical Infrastructure Data Eng, Experimentation, Platform Infrastructure, Security, Dev Tools Infrastructure from early beginnings You see that?

More information

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data

More information

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented

More information

Migrating from Oracle to Espresso

Migrating from Oracle to Espresso Migrating from Oracle to Espresso David Max Senior Software Engineer LinkedIn About LinkedIn New York Engineering Located in Empire State Building Approximately 100 engineers and 1000 employees total New

More information

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk Raanan Dagan and Rohit Pujari September 25, 2017 Washington, DC Forward-Looking Statements During the course of this presentation, we may

More information

Cisco Tetration Analytics

Cisco Tetration Analytics Cisco Tetration Analytics Enhanced security and operations with real time analytics John Joo Tetration Business Unit Cisco Systems Security Challenges in Modern Data Centers Securing applications has become

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Kafka Connect the Dots

Kafka Connect the Dots Kafka Connect the Dots Building Oracle Change Data Capture Pipelines With Kafka Mike Donovan CTO Dbvisit Software Mike Donovan Chief Technology Officer, Dbvisit Software Multi-platform DBA, (Oracle, MSSQL..)

More information

Data sources. Gartner, The State of Data Warehousing in 2012

Data sources. Gartner, The State of Data Warehousing in 2012 data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing

More information

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET Lenses 2.1 Enterprise Features PRODUCT DATA SHEET 1 OVERVIEW DataOps is the art of progressing from data to value in seconds. For us, its all about making data operations as easy and fast as using the

More information

Microservices Lessons Learned From a Startup Perspective

Microservices Lessons Learned From a Startup Perspective Microservices Lessons Learned From a Startup Perspective Susanne Kaiser @suksr CTO at Just Software @JustSocialApps Each journey is different People try to copy Netflix, but they can only copy what they

More information

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Dr. Steffen Hausmann, Solutions Architect Michael Hanisch, Manager Solutions Architecture November 18 th, 2016 Stream Processing Challenges

More information

MongoDB for a High Volume Logistics Application. Santa Clara, California April 23th 25th, 2018

MongoDB for a High Volume Logistics Application. Santa Clara, California April 23th 25th, 2018 MongoDB for a High Volume Logistics Application Santa Clara, California April 23th 25th, 2018 about me... Eric Potvin Software Engineer in the performance team at Shipwire, an Ingram Micro company, in

More information

efficient data ingestion March 27th 2018

efficient data ingestion March 27th 2018 efficient data ingestion March 27th 2018 Data Processing at the Speed of Thought fastdata.io inc. Santa Monica Seattle Performance Goals!Must be limited to hardware constraint!disk, Network and PCI bus

More information

Data sources. Gartner, The State of Data Warehousing in 2012

Data sources. Gartner, The State of Data Warehousing in 2012 data warehousing has reached the most significant tipping point since its inception. The biggest, possibly most elaborate data management system in IT is changing. Gartner, The State of Data Warehousing

More information

microsoft

microsoft 70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series

More information

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK DR. KONSTANTIN BOUDNIK DR.KONSTANTIN BOUDNIK EPAM SYSTEMS CHIEF TECHNOLOGIST BIGDATA, OPEN SOURCE

More information

Cisco Tetration Analytics

Cisco Tetration Analytics Cisco Tetration Analytics Enhanced security and operations with real time analytics Christopher Say (CCIE RS SP) Consulting System Engineer csaychoh@cisco.com Challenges in operating a hybrid data center

More information

Capture Business Opportunities from Systems of Record and Systems of Innovation

Capture Business Opportunities from Systems of Record and Systems of Innovation Capture Business Opportunities from Systems of Record and Systems of Innovation Amit Satoor, SAP March Hartz, SAP PUBLIC Big Data transformation powers digital innovation system Relevant nuggets of information

More information

Streaming Data: The Opportunity & How to Work With It

Streaming Data: The Opportunity & How to Work With It Streaming Data: The Opportunity & How to Work With It Roger Barga, GM Amazon Kinesis April 2016 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interest in and demand for stream

More information

The OLX data theory of everything

The OLX data theory of everything The OLX data theory of everything Caspar Schönau Head of Global BI Jakub Orłowski Data engineering manager The biggest internet company that you have never heard of Founded 1915 South-Africa Market cap:

More information

MariaDB MaxScale 2.0, basis for a Two-speed IT architecture

MariaDB MaxScale 2.0, basis for a Two-speed IT architecture MariaDB MaxScale 2.0, basis for a Two-speed IT architecture Harry Timm, Business Development Manager harry.timm@mariadb.com Telef: +49-176-2177 0497 MariaDB FASTEST GROWING OPEN SOURCE DATABASE * Innovation

More information

VOLTDB + HP VERTICA. page

VOLTDB + HP VERTICA. page VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics

More information

WHITEPAPER. The Lambda Architecture Simplified

WHITEPAPER. The Lambda Architecture Simplified WHITEPAPER The Lambda Architecture Simplified DATE: April 2016 A Brief History of the Lambda Architecture The surest sign you have invented something worthwhile is when several other people invent it too.

More information

Innovatus Technologies

Innovatus Technologies HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String

More information

Microsoft Azure Stream Analytics

Microsoft Azure Stream Analytics Microsoft Azure Stream Analytics Marcos Roriz and Markus Endler Laboratory for Advanced Collaboration (LAC) Departamento de Informática (DI) Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio)

More information

Cloud Analytics and Business Intelligence on AWS

Cloud Analytics and Business Intelligence on AWS Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse

More information

Building a Data Strategy for a Digital World

Building a Data Strategy for a Digital World Building a Data Strategy for a Digital World Jason Hunter, CTO, APAC Data Challenge: Pushing the Limits of What's Possible The Art of the Possible Multiple Government Agencies Data Hub 100 s of Service

More information

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Building Agile and Resilient Schema Transformations using Apache Kafka and ESB's Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Ricardo Ferreira

More information

Fast Big Data Analytics with Spark on Tachyon

Fast Big Data Analytics with Spark on Tachyon 1 Fast Big Data Analytics with Spark on Tachyon Shaoshan Liu http://www.meetup.com/tachyon/ 2 Fun Facts Tachyon A tachyon is a particle that always moves faster than light. The word comes from the Greek:

More information

Big Data Hadoop Course Content

Big Data Hadoop Course Content Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux

More information

Streaming Log Analytics with Kafka

Streaming Log Analytics with Kafka Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer Anything, In Real-Time. Why this talk? Humio is a Log Analytics system Designed to run on-prem High volume, real

More information

The Stream Processor as a Database. Ufuk

The Stream Processor as a Database. Ufuk The Stream Processor as a Database Ufuk Celebi @iamuce Realtime Counts and Aggregates The (Classic) Use Case 2 (Real-)Time Series Statistics Stream of Events Real-time Statistics 3 The Architecture collect

More information

MapR Enterprise Hadoop

MapR Enterprise Hadoop 2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS

More information

Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018

Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018 Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018 R E T H I N K I N G Stream Processing with Apache Kafka Kafka the Streaming Data Platform 1.0 Enterprise

More information

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka What problem does Kafka solve? Provides a way to deliver updates about changes in state from one service to another

More information

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch Nick Pentreath Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning

More information

Approaching the Petabyte Analytic Database: What I learned

Approaching the Petabyte Analytic Database: What I learned Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may

More information

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing

More information