BIG DATA REVOLUTION IN JOBRAPIDO

Similar documents
#MicroFocusCyberSummit

Elastic Stack in A Day Milano 16 Giugno 2016 REVOLUTIONIZE THE WAY PEOPLE GET JOBS WITH ELASTICSEARCH

Fluentd + MongoDB + Spark = Awesome Sauce

An Information Asset Hub. How to Effectively Share Your Data

Data Acquisition. The reference Big Data stack

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Evolution of an Apache Spark Architecture for Processing Game Data

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

An Introduction to Big Data Formats

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

iway iway Big Data Integrator New Features Bulletin and Release Notes Version DN

Big Data Integration Patterns. Michael Häusler Jun 12, 2017

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Flexible Network Analytics in the Cloud. Jon Dugan & Peter Murphy ESnet Software Engineering Group October 18, 2017 TechEx 2017, San Francisco

New Data Architectures For Netflow Analytics NANOG 74. Fangjin Yang - Imply

8/24/2017 Week 1-B Instructor: Sangmi Lee Pallickara

Data Acquisition. The reference Big Data stack

Personalizing Netflix with Streaming datasets

Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned. Yaroslav Tkachenko Senior Data Engineer at Activision

HUMIT Interactive Data Integration in a Data Lake System for the Life Sciences

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap

Certified Big Data and Hadoop Course Curriculum

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Flash Storage Complementing a Data Lake for Real-Time Insight

Hortonworks DataFlow Sam Lachterman Solutions Engineer

Lambda Architecture for Batch and Stream Processing. October 2018

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

SOLUTION TRACK Finding the Needle in a Big Data Innovator & Problem Solver Cloudera

Big Data Hadoop Stack

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

Achieve Data Democratization with effective Data Integration Saurabh K. Gupta

Data pipelines with PostgreSQL & Kafka

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

PNDA.io: when BGP meets Big-Data

Big Data Architect.

DATA SCIENCE USING SPARK: AN INTRODUCTION

Microsoft Exam

Unifying Big Data Workloads in Apache Spark

Data Architectures in Azure for Analytics & Big Data

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

Certified Big Data Hadoop and Spark Scala Course Curriculum

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Data Lake Based Systems that Work

Data Lakes. IN A Modern Data Architecture

IBM Db2 Event Store Simplifying and Accelerating Storage and Analysis of Fast Data. IBM Db2 Event Store

Distributed Computing.

Modern Data Warehouse The New Approach to Azure BI

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

@Pentaho #BigDataWebSeries

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

IBM Data Replication for Big Data

Scaling Marketplaces at Thumbtack QCon SF 2017

BIG DATA COURSE CONTENT

A Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers

Migrating from Oracle to Espresso

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Cisco Tetration Analytics

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Kafka Connect the Dots

Data sources. Gartner, The State of Data Warehousing in 2012

Lenses 2.1 Enterprise Features PRODUCT DATA SHEET

Microservices Lessons Learned From a Startup Perspective

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

MongoDB for a High Volume Logistics Application. Santa Clara, California April 23th 25th, 2018

efficient data ingestion March 27th 2018

Data sources. Gartner, The State of Data Warehousing in 2012

microsoft

IOTA ARCHITECTURE: DATA VIRTUALIZATION AND PROCESSING MEDIUM DR. KONSTANTIN BOUDNIK DR. ALEXANDRE BOUDNIK

Cisco Tetration Analytics

Capture Business Opportunities from Systems of Record and Systems of Innovation

Streaming Data: The Opportunity & How to Work With It

The OLX data theory of everything

MariaDB MaxScale 2.0, basis for a Two-speed IT architecture

VOLTDB + HP VERTICA. page

WHITEPAPER. The Lambda Architecture Simplified

Innovatus Technologies

Microsoft Azure Stream Analytics

Cloud Analytics and Business Intelligence on AWS

Building a Data Strategy for a Digital World

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Fast Big Data Analytics with Spark on Tachyon

Big Data Hadoop Course Content

Streaming Log Analytics with Kafka

The Stream Processor as a Database. Ufuk

MapR Enterprise Hadoop

Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018

Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Approaching the Petabyte Analytic Database: What I learned

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

Transcription:

BIG DATA REVOLUTION IN JOBRAPIDO Michele Pinto Big Data Technical Team Leader @ Jobrapido Big Data Tech 2016 Firenze - October 20, 2016

ABOUT ME NAME Michele Pinto LINKEDIN https://www.linkedin.com/in/pintomichele COMPANY WEBSITE www.jobrapido.com

WHO WE ARE VISITORS 1.0 BN visits / year UNIQUE VISITORS 35 Mio Uvs / month Jobrapido is the world's leading jobsearch engine that analyses and collects all job posts on the web, giving jobseekers all offers available, ordered for relevance based on the search they ve done SUBSCRIBERS 70+ Mio subs users (current stock) PAGEVIEWS / CLICKS* 280 Mio PVs / month & 130 Mio clicks / month JOBS 20+ Mio jobs at any given time Response Aggregation Analysis WEBSITES IN 58 COUNTRIES Head office Milan + office in Amsterdam PEOPLE 100+ * Clicks on job listings (organic + sponsored) and clicks on contextual ads

MOBILE APP SIGN IN SIGN UP CNT SELECTION MY SEARCHES MY JOBS MENU

WHERE WE ARE

THE NEED FOR A BIG DATA ARCHITECTURE (1/2) 7

THE NEED FOR A BIG DATA ARCHITECTURE (2/2) MAIN FEATURES: SCALE in terms of throughput and computational power correlated to the data growth rate Unify the tracking layer in a single TRACKING PLATFORM Place and extract data for analytics into a single DATA LAKE REAL-TIME DATA INGESTION in our Data Warehouse Drastically REDUCE COMPLEXITY and MAINTENANCE 8

TRACKING PLATFORM 9

WHY A NEW TRACKING PLATFORM (TP)? Obtain a unique, simple and scalable Tracking Layer Everyone in Jobrapido should design, track and query its own events Tracking phase and data processing phase totally decoupled Upcoming events queryable and processable in real-time Remove any bottleneck during the event tracking process 10

TP: ARCHITECTURAL OVERVIEW 11

TP TECHNOLOGIES AVRO (1/3) Data serialization system that provides a compact, fast, binary data format (avro.apache.org) MAIN FEATURES: Serialization into Avro/Binary or Avro/JSON Support for schema evolution: the schema used to read a file does not need to match the schema used to write the file Self-documenting: stores schema in file header Rich schema language defined in JSON Compressible and splittable (good for Spark and Map-Reduce) Can generate Java objects from schemas 12

TP TECHNOLOGIES AVRO (2/3) EVERYTHING IS AN EVENT = HEADER + BODY Each event has the same identical header containing some technical fields: What differs between different event types is the body, tracker fills only the body attributes 13

TP TECHNOLOGIES AVRO (3/3) BODY: EVERYONE CAN BUILD IT S OWN EVENT (E.G. THE EVENT CLICK) 14

TP TECHNOLOGIES KAFKA Kafka enables the capture, movement, processing and storage of data streams in a distributed, fault-tolerant fashion (kafka.apache.org) Events are sent directly to Kafka One topic per event type Retention policy is set to 15 days High-throughput More than 2000 messages /second (AVG) More than 1,5 MB / second (AVG) 15

DATA LAKE 16

WHY A DATA LAKE? If you think of a data mart as a store of bottled water cleansed and packaged and structured for easy consumption the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. (James Dixon, CTO of Pentaho) MAIN GOALS: Implement a massive storage platform of RAW DATA An immutable MASTER DATA, information is never deleted Store as much data as we want at a very CHEAP PRICE Data must be available for various tasks including reporting, visualization, analytics and machine learning 17

DATA LAKE: ARCHITECTURAL OVERVIEW 18

DATA LAKE TECHNOLOGIES FLUME (1/2) Distributed data collection service for efficiently collecting and moving large amounts of log data (flume.apache.org) MAIN FEATURES: Distributed, scalable and reliable Contextual and dynamic event routing Fully extensible (plugin architecture) Fully integrated in the Big Data ecosystem Easy to install and configure 19

DATA LAKE TECHNOLOGIES FLUME (2/2) FLUME AGENT = SOURCE + [INTERCEPTORS] + CHANNEL + SINK 20

REAL-TIME DATA WAREHOUSE INGESTION 21

REAL-TIME DATA WAREHOUSE INGESTION (1/2) MAIN GOALS: Data Lake decoupled from Data Warehouse Staging area automatically ingested in real-time Data marts can be refreshed faster No data pipeline to implement or maintain Ingestion automatically scheduled, filtered and parsed JSON events automatically filled in target tables Events are queryable in real-time with the best performance on the market 22

REAL-TIME DATA WAREHOUSE INGESTION (2/2) KAFKA AND VERTICA WORK TOGETHER: Vertica acts as a consumer for Kafka (microbatch) Scheduling, filtering, parsing (JSON, Avro, custom) Vertica->Kafka: Vertica is able to send query results to Kafka Monitoring data load activities via Web UI Stream, rates, schedulers, rates, rejections and errors In-database monitoring 23

JOBRAPIDO BIG DATA ARCHITECTURE 24

WHAT S NEXT Kafka Connect vs Flafka evaluation Enrichment of event streams with Kafka Stream Unleash the power of Spark Integrate Knime with the Data Lake Implement a lot of Data Marts 25

GRAZIE 26