BIG DATA REVOLUTION IN JOBRAPIDO Michele Pinto Big Data Technical Team Leader @ Jobrapido Big Data Tech 2016 Firenze - October 20, 2016
ABOUT ME NAME Michele Pinto LINKEDIN https://www.linkedin.com/in/pintomichele COMPANY WEBSITE www.jobrapido.com
WHO WE ARE VISITORS 1.0 BN visits / year UNIQUE VISITORS 35 Mio Uvs / month Jobrapido is the world's leading jobsearch engine that analyses and collects all job posts on the web, giving jobseekers all offers available, ordered for relevance based on the search they ve done SUBSCRIBERS 70+ Mio subs users (current stock) PAGEVIEWS / CLICKS* 280 Mio PVs / month & 130 Mio clicks / month JOBS 20+ Mio jobs at any given time Response Aggregation Analysis WEBSITES IN 58 COUNTRIES Head office Milan + office in Amsterdam PEOPLE 100+ * Clicks on job listings (organic + sponsored) and clicks on contextual ads
MOBILE APP SIGN IN SIGN UP CNT SELECTION MY SEARCHES MY JOBS MENU
WHERE WE ARE
THE NEED FOR A BIG DATA ARCHITECTURE (1/2) 7
THE NEED FOR A BIG DATA ARCHITECTURE (2/2) MAIN FEATURES: SCALE in terms of throughput and computational power correlated to the data growth rate Unify the tracking layer in a single TRACKING PLATFORM Place and extract data for analytics into a single DATA LAKE REAL-TIME DATA INGESTION in our Data Warehouse Drastically REDUCE COMPLEXITY and MAINTENANCE 8
TRACKING PLATFORM 9
WHY A NEW TRACKING PLATFORM (TP)? Obtain a unique, simple and scalable Tracking Layer Everyone in Jobrapido should design, track and query its own events Tracking phase and data processing phase totally decoupled Upcoming events queryable and processable in real-time Remove any bottleneck during the event tracking process 10
TP: ARCHITECTURAL OVERVIEW 11
TP TECHNOLOGIES AVRO (1/3) Data serialization system that provides a compact, fast, binary data format (avro.apache.org) MAIN FEATURES: Serialization into Avro/Binary or Avro/JSON Support for schema evolution: the schema used to read a file does not need to match the schema used to write the file Self-documenting: stores schema in file header Rich schema language defined in JSON Compressible and splittable (good for Spark and Map-Reduce) Can generate Java objects from schemas 12
TP TECHNOLOGIES AVRO (2/3) EVERYTHING IS AN EVENT = HEADER + BODY Each event has the same identical header containing some technical fields: What differs between different event types is the body, tracker fills only the body attributes 13
TP TECHNOLOGIES AVRO (3/3) BODY: EVERYONE CAN BUILD IT S OWN EVENT (E.G. THE EVENT CLICK) 14
TP TECHNOLOGIES KAFKA Kafka enables the capture, movement, processing and storage of data streams in a distributed, fault-tolerant fashion (kafka.apache.org) Events are sent directly to Kafka One topic per event type Retention policy is set to 15 days High-throughput More than 2000 messages /second (AVG) More than 1,5 MB / second (AVG) 15
DATA LAKE 16
WHY A DATA LAKE? If you think of a data mart as a store of bottled water cleansed and packaged and structured for easy consumption the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples. (James Dixon, CTO of Pentaho) MAIN GOALS: Implement a massive storage platform of RAW DATA An immutable MASTER DATA, information is never deleted Store as much data as we want at a very CHEAP PRICE Data must be available for various tasks including reporting, visualization, analytics and machine learning 17
DATA LAKE: ARCHITECTURAL OVERVIEW 18
DATA LAKE TECHNOLOGIES FLUME (1/2) Distributed data collection service for efficiently collecting and moving large amounts of log data (flume.apache.org) MAIN FEATURES: Distributed, scalable and reliable Contextual and dynamic event routing Fully extensible (plugin architecture) Fully integrated in the Big Data ecosystem Easy to install and configure 19
DATA LAKE TECHNOLOGIES FLUME (2/2) FLUME AGENT = SOURCE + [INTERCEPTORS] + CHANNEL + SINK 20
REAL-TIME DATA WAREHOUSE INGESTION 21
REAL-TIME DATA WAREHOUSE INGESTION (1/2) MAIN GOALS: Data Lake decoupled from Data Warehouse Staging area automatically ingested in real-time Data marts can be refreshed faster No data pipeline to implement or maintain Ingestion automatically scheduled, filtered and parsed JSON events automatically filled in target tables Events are queryable in real-time with the best performance on the market 22
REAL-TIME DATA WAREHOUSE INGESTION (2/2) KAFKA AND VERTICA WORK TOGETHER: Vertica acts as a consumer for Kafka (microbatch) Scheduling, filtering, parsing (JSON, Avro, custom) Vertica->Kafka: Vertica is able to send query results to Kafka Monitoring data load activities via Web UI Stream, rates, schedulers, rates, rejections and errors In-database monitoring 23
JOBRAPIDO BIG DATA ARCHITECTURE 24
WHAT S NEXT Kafka Connect vs Flafka evaluation Enrichment of event streams with Kafka Stream Unleash the power of Spark Integrate Knime with the Data Lake Implement a lot of Data Marts 25
GRAZIE 26