Scaling the Yelp s logging pipeline with Apache Kafka. Enrico
|
|
- Harvey Page
- 5 years ago
- Views:
Transcription
1 Scaling the Yelp s logging pipeline with Apache Kafka Enrico Canzonieri
2 Yelp s Mission Connecting people with great local businesses.
3 Yelp Stats As of Q M 102M 70% 32
4 What to expect High-level architecture overview Kafka best-practices No code
5 Logging? What is that?
6 A log is a stream of events
7 Logs provide valuable operational and business visibility.
8 Producing and consuming logs should be easy
9 Logs at scale
10 Consuming logs becomes hard Solution: Centralized logging!
11 Simple aggregation strategy: upload all logs to a centralized datastore Pros: Easy to implement (cron, logrotate, etc) Unique place to access logs Cons: No real time streaming Colocation not aggregation Amazon S3
12 Log aggregation systems
13 Scribe Push based architecture
14 Yelp s logging pipeline V1.0 Yelp Server Web app Log aggregator Scribe Leaf Yelp Server S3 uploader Scribe aggregator xinetd tail -follow=name /nail/scribe/<category>/current Web app Scribe Leaf Amazon S3 consumer consumer
15 Why use Scribe? Simple to configure Multilang support (via Thrift) Stable (most of the times) and fast Support arbitrary message size
16 Why move away from Scribe? No support for log consumption Doesn t really scale No replication Abandoned project Lack of plugins
17 Stream processing: Statmonster
18 Stream processing: Statmonster Statmonster is a real-time metrics pipeline The first stage extract multiple metrics from each line Consumes most of the high volume logs
19 Stream processing: Statmonster In 2013 the metrics from high volume logs started falling behind The first stage of the pipeline was too slow to keep up with the increased volume of logs
20 Stream processing: Statmonster Short term solution: sample logs to save timeliness Long term solution: run Statmonster on Apache Storm.
21 Stream processing: Statmonster Short term solution: sample logs to save timeliness Long term solution: run Statmonster on Apache Storm.
22 The real bottleneck Can t scale the consumer without increasing the number of physical log aggregators Aggregators tail -f /nail/scribe/access tail -f /nail/scribe/access tail -f /nail/scribe/access Statmonster Statmonster Statmonster
23 Need to scale the source!
24 Kafka into the mix High-throughput Publish-subscribe messaging system Distributed, replicated commit log Client library available for a variety of languages Configurable retention
25 Kafka Message (topic, partition, offset) Message Topic Partition Partition Partition Old New
26 Kafka Producer Java producer is async by default Messages are produced in batch Support either at-least-once or atmost-once semantic Support compression out of the box
27 Kafka Consumer Identified by a group id Consumers coordinate to consume from different partitions Ability to re-play messages Native support for offset commit
28 Architecture overview
29 Yelp s logging pipeline V1.5 Amazon S3 Scribe Leaf Sekretar Scribe aggregator Scribe Leaf Kafka consumer Legacy consumer Legacy consumer Kafka consumer
30 Sekretar Act as a Scribe aggregator speaking the Scribe protocol Produce messages to Kafka Map a log category into a topic Act as an intermediate buffer Handle big messages
31 Statmonster 2.0 access partition 0 Statmonster Kafka consumer access partition 1 Statmonster Kafka consumer access partition N Statmonster Kafka consumer
32 Yelp s logging pipeline V2.0 Amazon S3 Scribe Leaf Sekretar Scribe Leaf Legacy consumer Legacy consumer Scribe Kafka Reader Kafka consumer
33 Multi-regional architecture Data center 1 Data center 2 Aggregate logs consumer Sekretar Kafka consumer Kafka consumer Sekretar Amazon S3 Leaf Leaf
34 Configuration
35 Availability vs Consistency Cluster side: Unclean leader election Replication factor Disabled unclean.leader.election.enable 3: 1 leader + 2 followers default.replication.factor Minimum in-sync replicas (ISR) Quorum: 3/2 + 1 = 2 min. insync.replicas Producer side: All (or -1): all replicas in the ISR need to ack acks acks
36 Provisioning partitions How many partitions for a topic? Producer Topic 1 Partition 1 Consumer Producer Partition 2 Consumer Group 1 Partition 3 Consumer Producer Producer Partition 1 Partition 2 Consumer Consumer Group 2 Topic 2
37 Criteria 1: consumer Consumer speed Number of different consumer groups
38 Criteria 2: Kafka Physical Limits of a single broker egress/ingress Log retention Recovery speed upon broker failures
39 Criteria 3: producer Ingress rate into Kafka in msg/s and bytes/s
40 Accounting for spikes Message rate (msg/s) percentiles: 50, 75, 90, 95, 99, 99.5
41 Partitions auto-creation Regular traffic increase Automatically increase the number of partitions based on 99th percentile of bytes/s and msg/s over a time window of 3 days. Use a combined threshold of 700 msg/s and 500 KiB/s.
42 Tooling
43 Kafka Manager
44 Kafka Utils Contains several command line tools to help operating and maintaining a Kafka cluster. Cluster rolling restart Consumer offset management Cluster rebalance and broker decommission Healthchecks
45 Monitoring
46 Monitoring Sekretar Ingress: msg/s, bytes/s Egress: msg/s, (compressed) bytes/s Message delay Producer latency Producer memory buffer Timed-out message count
47 Monitoring Sekretar Unhealthy Kafka cluster
48 Kafka monitoring Kafka exports metrics via jmx ( io/1.0/kafka/monitoring.html): Offline partitions Under replicated partitions Controller count Kafka-Utils includes a check for min.isr
49 Consumer monitoring Consumer speed (msg/s) Kafka speed Consumer lag (msg)
50 Consumer monitoring Consumer slow down or logs volume increase
51 Apache Yelp 18 clusters ~ 100 brokers > 20K topics ~ 45K partitions > 5TB of data per day ~ 25 Billion messages per day
52 engineeringblog.yelp.com github.com/yelp
Intra-cluster Replication for Apache Kafka. Jun Rao
Intra-cluster Replication for Apache Kafka Jun Rao About myself Engineer at LinkedIn since 2010 Worked on Apache Kafka and Cassandra Database researcher at IBM Outline Overview of Kafka Kafka architecture
More informationKafka-Utils Documentation
Kafka-Utils Documentation Release 1.2.0 Yelp Inc. Jun 20, 2017 Contents 1 Description 1 2 How to install 3 2.1 Configuration............................................... 3 2.2 Cluster Manager.............................................
More informationA Distributed System Case Study: Apache Kafka. High throughput messaging for diverse consumers
A Distributed System Case Study: Apache Kafka High throughput messaging for diverse consumers As always, this is not a tutorial Some of the concepts may no longer be part of the current system or implemented
More informationIntroduction to Kafka (and why you care)
Introduction to Kafka (and why you care) Richard Nikula VP, Product Development and Support Nastel Technologies, Inc. 2 Introduction Richard Nikula VP of Product Development and Support Involved in MQ
More informationrkafka rkafka is a package created to expose functionalities provided by Apache Kafka in the R layer. Version 1.1
rkafka rkafka is a package created to expose functionalities provided by Apache Kafka in the R layer. Version 1.1 Wednesday 28 th June, 2017 rkafka Shruti Gupta Wednesday 28 th June, 2017 Contents 1 Introduction
More informationFluentd + MongoDB + Spark = Awesome Sauce
Fluentd + MongoDB + Spark = Awesome Sauce Nishant Sahay, Sr. Architect, Wipro Limited Bhavani Ananth, Tech Manager, Wipro Limited Your company logo here Wipro Open Source Practice: Vision & Mission Vision
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2017/18 Valeria Cardellini The reference
More informationDeep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services
Deep Dive Amazon Kinesis Ian Meyers, Principal Solution Architect - Amazon Web Services Analytics Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
More informationBuilding Durable Real-time Data Pipeline
Building Durable Real-time Data Pipeline Apache BookKeeper at Twitter @sijieg Twitter Background Layered Architecture Agenda Design Details Performance Scale @Twitter Q & A Publish-Subscribe Online services
More informationInstalling and configuring Apache Kafka
3 Installing and configuring Date of Publish: 2018-08-13 http://docs.hortonworks.com Contents Installing Kafka...3 Prerequisites... 3 Installing Kafka Using Ambari... 3... 9 Preparing the Environment...9
More informationData Acquisition. The reference Big Data stack
Università degli Studi di Roma Tor Vergata Dipartimento di Ingegneria Civile e Ingegneria Informatica Data Acquisition Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini The reference
More informationWHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka
WHITE PAPER Reference Guide for Deploying and Configuring Apache Kafka Revised: 02/2015 Table of Content 1. Introduction 3 2. Apache Kafka Technology Overview 3 3. Common Use Cases for Kafka 4 4. Deploying
More informationKafka Integration User Guide. Functional Area: Kafka Integration. Geneos Release: v4.7. Document Version: v1.0.0
Kafka Integration User Guide Functional Area: Kafka Integration Geneos Release: v4.7 Document Version: v1.0.0 Date Published: 23 July 2018 Copyright 2018. ITRS Group Ltd. All rights reserved. Information
More informationBuilding LinkedIn s Real-time Data Pipeline. Jay Kreps
Building LinkedIn s Real-time Data Pipeline Jay Kreps What is a data pipeline? What data is there? Database data Activity data Page Views, Ad Impressions, etc Messaging JMS, AMQP, etc Application and
More informationSeagull: A distributed, fault tolerant, concurrent task runner. Sagar Patwardhan
Seagull: A distributed, fault tolerant, concurrent task runner Sagar Patwardhan sagarp@yelp.com Yelp s Mission Connecting people with great local businesses. Yelp scale Outline What is Seagull? Why did
More informationTools for Social Networking Infrastructures
Tools for Social Networking Infrastructures 1 Cassandra - a decentralised structured storage system Problem : Facebook Inbox Search hundreds of millions of users distributed infrastructure inbox changes
More information@joerg_schad Nightmares of a Container Orchestration System
@joerg_schad Nightmares of a Container Orchestration System 2017 Mesosphere, Inc. All Rights Reserved. 1 Jörg Schad Distributed Systems Engineer @joerg_schad Jan Repnak Support Engineer/ Solution Architect
More informationUsing the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver
Using the SDACK Architecture to Build a Big Data Product Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver Outline A Threat Analytic Big Data product The SDACK Architecture Akka Streams and data
More informationAuto Management for Apache Kafka and Distributed Stateful System in General
Auto Management for Apache Kafka and Distributed Stateful System in General Jiangjie (Becket) Qin Data Infrastructure @LinkedIn GIAC 2017, 12/23/17@Shanghai Agenda Kafka introduction and terminologies
More informationTyphoon: An SDN Enhanced Real-Time Big Data Streaming Framework
Typhoon: An SDN Enhanced Real-Time Big Data Streaming Framework Junguk Cho, Hyunseok Chang, Sarit Mukherjee, T.V. Lakshman, and Jacobus Van der Merwe 1 Big Data Era Big data analysis is increasingly common
More informationIntroduction to Apache Kafka
Introduction to Apache Kafka Chris Curtin Head of Technical Research Atlanta Java Users Group March 2013 About Me 20+ years in technology Head of Technical Research at Silverpop (12 + years at Silverpop)
More informationCisco Tetration Analytics
Cisco Tetration Analytics Enhanced security and operations with real time analytics Christopher Say (CCIE RS SP) Consulting System Engineer csaychoh@cisco.com Challenges in operating a hybrid data center
More informationKonstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File
More informationMigrating to Cassandra in the Cloud, the Netflix Way
Migrating to Cassandra in the Cloud, the Netflix Way Jason Brown - @jasobrown Senior Software Engineer, Netflix Tech History, 1998-2008 In the beginning, there was the webapp and a single database in a
More informationData pipelines with PostgreSQL & Kafka
Data pipelines with PostgreSQL & Kafka Oskari Saarenmaa PostgresConf US 2018 - Jersey City Agenda 1. Introduction 2. Data pipelines, old and new 3. Apache Kafka 4. Sample data pipeline with Kafka & PostgreSQL
More informationHow Eventual is Eventual Consistency?
Probabilistically Bounded Staleness How Eventual is Eventual Consistency? Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, Ion Stoica (UC Berkeley) BashoChats 002, 28 February
More informationPostgreSQL to MySQL A DBA's Perspective. Patrick
PostgreSQL to MySQL A DBA's Perspective Patrick King @mr_mustash Yelp s Mission Connecting people with great local businesses. My Database Experience Started using Postgres 7 years ago Postgres 8.4 (released
More informationmicrosoft
70-775.microsoft Number: 70-775 Passing Score: 800 Time Limit: 120 min Exam A QUESTION 1 Note: This question is part of a series of questions that present the same scenario. Each question in the series
More informationApache Kafka a system optimized for writing. Bernhard Hopfenmüller. 23. Oktober 2018
Apache Kafka...... a system optimized for writing Bernhard Hopfenmüller 23. Oktober 2018 whoami Bernhard Hopfenmüller IT Consultant @ ATIX AG IRC: Fobhep github.com/fobhep whoarewe The Linux & Open Source
More informationIntroduc)on to Apache Ka1a. Jun Rao Co- founder of Confluent
Introduc)on to Apache Ka1a Jun Rao Co- founder of Confluent Agenda Why people use Ka1a Technical overview of Ka1a What s coming What s Apache Ka1a Distributed, high throughput pub/sub system Ka1a Usage
More informationWhich technology to choose in AWS?
Which technology to choose in AWS? RDS / Aurora / Roll-your-own April 17, 2018 Daniel Kowalewski Senior Technical Operations Engineer Percona 1 2017 Percona AWS MySQL options RDS for MySQL Aurora MySQL
More informationREAL-TIME ANALYTICS WITH APACHE STORM
REAL-TIME ANALYTICS WITH APACHE STORM Mevlut Demir PhD Student IN TODAY S TALK 1- Problem Formulation 2- A Real-Time Framework and Its Components with an existing applications 3- Proposed Framework 4-
More informationDistributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang
A lightweight, pluggable, and scalable ingestion service for real-time data ABSTRACT This paper provides the motivation, implementation details, and evaluation of a lightweight distributed extract-transform-load
More informationQunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio
CASE STUDY Qunar Performs Real-Time Data Analytics up to 300x Faster with Alluxio Xueyan Li, Lei Xu, and Xiaoxu Lv Software Engineers at Qunar At Qunar, we have been running Alluxio in production for over
More informationMicroservices, Messaging and Science Gateways. Review microservices for science gateways and then discuss messaging systems.
Microservices, Messaging and Science Gateways Review microservices for science gateways and then discuss messaging systems. Micro- Services Distributed Systems DevOps The Gateway Octopus Diagram Browser
More informationData Infrastructure at LinkedIn. Shirshanka Das XLDB 2011
Data Infrastructure at LinkedIn Shirshanka Das XLDB 2011 1 Me UCLA Ph.D. 2005 (Distributed protocols in content delivery networks) PayPal (Web frameworks and Session Stores) Yahoo! (Serving Infrastructure,
More informationScalable Streaming Analytics
Scalable Streaming Analytics KARTHIK RAMASAMY @karthikz TALK OUTLINE BEGIN I! II ( III b Overview Storm Overview Storm Internals IV Z V K Heron Operational Experiences END WHAT IS ANALYTICS? according
More informationEvolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo
Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo About Me Currently Co-founder/CEO of Qubole Ran the Data Infrastructure Team at Facebook till 2011
More informationToward a Reliable Data Transport Architecture for Optical Burst-Switched Networks
Toward a Reliable Data Transport Architecture for Optical Burst-Switched Networks Dr. Vinod Vokkarane Assistant Professor, Computer and Information Science Co-Director, Advanced Computer Networks Lab University
More informationApache Storm. Hortonworks Inc Page 1
Apache Storm Page 1 What is Storm? Real time stream processing framework Scalable Up to 1 million tuples per second per node Fault Tolerant Tasks reassigned on failure Guaranteed Processing At least once
More information18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Oct 27 10:05:07 2011 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationStreaming Log Analytics with Kafka
Streaming Log Analytics with Kafka Kresten Krab Thorup, Humio CTO Log Everything, Answer Anything, In Real-Time. Why this talk? Humio is a Log Analytics system Designed to run on-prem High volume, real
More informationFROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà
FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà WHO ARE WE? Marc Sturlese - @sturlese Backend engineer, focused on R&D Interests: search, scalability Dani Solà - @dani_sola Backend engineer
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationLecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka
Lecture 21 11/27/2017 Next Lecture: Quiz review & project meetings Streaming & Apache Kafka What problem does Kafka solve? Provides a way to deliver updates about changes in state from one service to another
More informationEvent Streams using Apache Kafka
Event Streams using Apache Kafka And how it relates to IBM MQ Andrew Schofield Chief Architect, Event Streams STSM, IBM Messaging, Hursley Park Event-driven systems deliver more engaging customer experiences
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 1: Distributed File Systems GFS (The Google File System) 1 Filesystems
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More information4/9/2018 Week 13-A Sangmi Lee Pallickara. CS435 Introduction to Big Data Spring 2018 Colorado State University. FAQs. Architecture of GFS
W13.A.0.0 CS435 Introduction to Big Data W13.A.1 FAQs Programming Assignment 3 has been posted PART 2. LARGE SCALE DATA STORAGE SYSTEMS DISTRIBUTED FILE SYSTEMS Recitations Apache Spark tutorial 1 and
More informationPaxos Replicated State Machines as the Basis of a High- Performance Data Store
Paxos Replicated State Machines as the Basis of a High- Performance Data Store William J. Bolosky, Dexter Bradshaw, Randolph B. Haagens, Norbert P. Kusters and Peng Li March 30, 2011 Q: How to build a
More informationDeveloping Microsoft Azure Solutions (70-532) Syllabus
Developing Microsoft Azure Solutions (70-532) Syllabus Cloud Computing Introduction What is Cloud Computing Cloud Characteristics Cloud Computing Service Models Deployment Models in Cloud Computing Advantages
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system
More informationTransformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's
Building Agile and Resilient Schema Transformations using Apache Kafka and ESB's Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's Ricardo Ferreira
More informationHDFS Architecture. Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
HDFS Architecture Gregory Kesden, CSE-291 (Storage Systems) Fall 2017 Based Upon: http://hadoop.apache.org/docs/r3.0.0-alpha1/hadoopproject-dist/hadoop-hdfs/hdfsdesign.html Assumptions At scale, hardware
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 presented by Kun Suo Outline GFS Background, Concepts and Key words Example of GFS Operations Some optimizations in
More informationApache Hadoop Goes Realtime at Facebook. Himanshu Sharma
Apache Hadoop Goes Realtime at Facebook Guide - Dr. Sunny S. Chung Presented By- Anand K Singh Himanshu Sharma Index Problem with Current Stack Apache Hadoop and Hbase Zookeeper Applications of HBase at
More informationHow we scaled push messaging for millions of Netflix devices. Susheel Aroskar Cloud Gateway
How we scaled push messaging for millions of Netflix devices Susheel Aroskar Cloud Gateway Why do we need push? How I spend my time in Netflix application... What is push? What is push? How you can build
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationCSE 124: Networked Services Lecture-16
Fall 2010 CSE 124: Networked Services Lecture-16 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa10/cse124 11/23/2010 CSE 124 Networked Services Fall 2010 1 Updates PlanetLab experiments
More informationDeveloping Microsoft Azure Solutions (70-532) Syllabus
Developing Microsoft Azure Solutions (70-532) Syllabus Cloud Computing Introduction What is Cloud Computing Cloud Characteristics Cloud Computing Service Models Deployment Models in Cloud Computing Advantages
More informationVoldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation
Voldemort Smruti R. Sarangi Department of Computer Science Indian Institute of Technology New Delhi, India Smruti R. Sarangi Leader Election 1/29 Outline 1 2 3 Smruti R. Sarangi Leader Election 2/29 Data
More informationCockroachDB on DC/OS. Ben Darnell, CTO, Cockroach Labs
CockroachDB on DC/OS Ben Darnell, CTO, Cockroach Labs Agenda A cloud-native database CockroachDB on DC/OS Why CockroachDB Demo! Cloud-Native Database What is Cloud-Native? Horizontally scalable Individual
More informationThe Google File System. Alexandru Costan
1 The Google File System Alexandru Costan Actions on Big Data 2 Storage Analysis Acquisition Handling the data stream Data structured unstructured semi-structured Results Transactions Outline File systems
More informationRoad to Auto Scaling
Road to Auto Scaling Varun Thacker Lucidworks Apache Lucene/Solr Committer, and PMC member Agenda APIs Metrics Recipes Auto-Scale Triggers SolrCloud Overview ZooKee per Lots Shard 1 Leader Shard 3 Replica
More informationMesos at Yelp: Building a production ready PaaS. Rob Johnson
Mesos at Yelp: Building a production ready PaaS Rob Johnson robj@yelp.com/@rob_johnson_ Who Am I: - Rob Johnson - Operations Team at Yelp - Spend most of my time working on PaaSTA Yelp s Mission: Connecting
More informationHighway to Hell or Stairway to Cloud?
Highway to Hell or Stairway to Cloud? Percona Live 2018, Frankfurt ALEXANDER KUKUSHKIN 06-11-2018 ABOUT ME Alexander Kukushkin Database Engineer @ZalandoTech The Patroni guy alexander.kukushkin@zalando.de
More informationThe Google File System (GFS)
1 The Google File System (GFS) CS60002: Distributed Systems Antonio Bruto da Costa Ph.D. Student, Formal Methods Lab, Dept. of Computer Sc. & Engg., Indian Institute of Technology Kharagpur 2 Design constraints
More informationCSE 124: Networked Services Fall 2009 Lecture-19
CSE 124: Networked Services Fall 2009 Lecture-19 Instructor: B. S. Manoj, Ph.D http://cseweb.ucsd.edu/classes/fa09/cse124 Some of these slides are adapted from various sources/individuals including but
More informationIf Only I Could Find My Databases-Service Discovery with SmartStack and MySQL. Susanne Lehmann, Yelp
If Only I Could Find My Databases-Service Discovery with SmartStack and MySQL Susanne Lehmann, Yelp susanne@yelp.com Me I ve been a DBA for 16 years I ve been working with all kinds of relational and NoSQL
More informationEvaluation of Apache Kafka in Real-Time Big Data Pipeline Architecture
Evaluation of Apache Kafka in Real-Time Big Data Pipeline Architecture Thandar Aung, Hla Yin Min, Aung Htein Maw University of Information Technology Yangon, Myanmar thandaraung@uit.edu.mm, hlayinmin@uit.edu.mm,
More informationSQL, NoSQL, MongoDB. CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden
SQL, NoSQL, MongoDB CSE-291 (Cloud Computing) Fall 2016 Gregory Kesden SQL Databases Really better called Relational Databases Key construct is the Relation, a.k.a. the table Rows represent records Columns
More informationLet the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka. Materna GmbH
Let the data flow! Data Streaming & Messaging with Apache Kafka Frank Pientka Wer ist Frank Pientka? Dipl.-Informatiker (TH Karlsruhe) Verheiratet, 2 Töchter Principal Software Architect in Dortmund Fast
More informationAmazon Aurora Deep Dive
Amazon Aurora Deep Dive Enterprise-class database for the cloud Damián Arregui, Solutions Architect, AWS October 27 th, 2016 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Enterprise
More informationReliable Distributed Messaging with HornetQ
Reliable Distributed Messaging with HornetQ Lin Zhao Software Engineer, Groupon lin@groupon.com Agenda Introduction MessageBus Design Client API Monitoring Comparison with HornetQ Cluster Future Work Introduction
More informationMegastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services & Spanner: Google s Globally- Distributed Database. Presented by Kewei Li The Problem db nosql complex legacy tuning expensive
More informationEVCache: Lowering Costs for a Low Latency Cache with RocksDB. Scott Mansfield Vu Nguyen EVCache
EVCache: Lowering Costs for a Low Latency Cache with RocksDB Scott Mansfield Vu Nguyen EVCache 90 seconds What do caches touch? Signing up* Logging in Choosing a profile Picking liked videos
More informationMigrating to Aurora MySQL and Monitoring with PMM. Percona Technical Webinars August 1, 2018
Migrating to Aurora MySQL and Monitoring with PMM Percona Technical Webinars August 1, 2018 Introductions Introduction Vineet Khanna (Autodesk) Senior Database Engineer vineet.khanna@autodesk.com Tate
More informationBuilding Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned. Yaroslav Tkachenko Senior Data Engineer at Activision
Building Scalable and Extendable Data Pipeline for Call of Duty Games: Lessons Learned Yaroslav Tkachenko Senior Data Engineer at Activision 1+ PB Data lake size (AWS S3) Number of topics in the biggest
More informationHDFS Architecture Guide
by Dhruba Borthakur Table of contents 1 Introduction...3 2 Assumptions and Goals...3 2.1 Hardware Failure... 3 2.2 Streaming Data Access...3 2.3 Large Data Sets...3 2.4 Simple Coherency Model... 4 2.5
More informationGFS: The Google File System. Dr. Yingwu Zhu
GFS: The Google File System Dr. Yingwu Zhu Motivating Application: Google Crawl the whole web Store it all on one big disk Process users searches on one big CPU More storage, CPU required than one PC can
More informationDistributed Data Analytics Stream Processing
G-3.1.09, Campus III Hasso Plattner Institut Types of Systems Services (online systems) Accept requests and send responses Performance measure: response time and availability Expected runtime: milliseconds
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More information18-hdfs-gfs.txt Thu Nov 01 09:53: Notes on Parallel File Systems: HDFS & GFS , Fall 2012 Carnegie Mellon University Randal E.
18-hdfs-gfs.txt Thu Nov 01 09:53:32 2012 1 Notes on Parallel File Systems: HDFS & GFS 15-440, Fall 2012 Carnegie Mellon University Randal E. Bryant References: Ghemawat, Gobioff, Leung, "The Google File
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google SOSP 03, October 19 22, 2003, New York, USA Hyeon-Gyu Lee, and Yeong-Jae Woo Memory & Storage Architecture Lab. School
More informationTail Latency in ZooKeeper and a Simple Reimplementation
Tail Latency in ZooKeeper and a Simple Reimplementation Michael Graczyk Abstract ZooKeeper [1] is a commonly used service for coordinating distributed applications. ZooKeeper uses leader-based atomic broadcast
More informationPerformance and Forgiveness. June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences
Performance and Forgiveness June 23, 2008 Margo Seltzer Harvard University School of Engineering and Applied Sciences Margo Seltzer Architect Outline A consistency primer Techniques and costs of consistency
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationExam C IBM Cloud Platform Application Development v2 Sample Test
Exam C5050 384 IBM Cloud Platform Application Development v2 Sample Test 1. What is an advantage of using managed services in IBM Bluemix Platform as a Service (PaaS)? A. The Bluemix cloud determines the
More informationMonitor your containers with the Elastic Stack. Monica Sarbu
Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team monica@elastic.co 3 Monitor your containers with the Elastic Stack Elastic Stack 5 Beats are lightweight shippers
More informationServers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12
Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, Netflix @eatupmartha #netflixcloud #cassandra12 1 June 29, 2012 2 3 4 [1] 5 From the Netflix tech blog: Cassandra, our distributed cloud persistence
More informationAmazon Aurora Deep Dive
Amazon Aurora Deep Dive Kevin Jernigan, Sr. Product Manager Amazon Aurora PostgreSQL Amazon RDS for PostgreSQL May 18, 2017 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda
More informationFast Data apps with Alpakka Kafka connector. Sean Glover,
Fast Data apps with Alpakka Kafka connector Sean Glover, Lightbend @seg1o Who am I? I m Sean Glover Senior Software Engineer at Lightbend Member of the Fast Data Platform team Organizer of Scala Toronto
More informationTwitter Heron: Stream Processing at Scale
Twitter Heron: Stream Processing at Scale Saiyam Kohli December 8th, 2016 CIS 611 Research Paper Presentation -Sun Sunnie Chung TWITTER IS A REAL TIME ABSTRACT We process billions of events on Twitter
More informationAdaptive replica consistency policy for Kafka
Adaptive replica consistency policy for Kafka Zonghuai Guo 1,2,*, Shiwang Ding 1,2 1 Chongqing University of Posts and Telecommunications, 400065, Nan'an District, Chongqing, P.R.China 2 Chongqing Mobile
More informationCS435 Introduction to Big Data FALL 2018 Colorado State University. 11/7/2018 Week 12-B Sangmi Lee Pallickara. FAQs
11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.0.0 CS435 Introduction to Big Data 11/7/2018 CS435 Introduction to Big Data - FALL 2018 W12.B.1 FAQs Deadline of the Programming Assignment 3
More informationFast Data apps with Alpakka Kafka connector and Akka Streams. Sean Glover,
Fast Data apps with Alpakka Kafka connector and Akka Streams Sean Glover, Lightbend @seg1o Who am I? I m Sean Glover Principal Engineer at Lightbend Member of the Fast Data Platform team Organizer of Scala
More informationTrends in Data Protection and Restoration Technologies. Mike Fishman, EMC 2 Corporation
Trends in Data Protection and Restoration Technologies Mike Fishman, EMC 2 Corporation SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member
More informationEvolution of an Apache Spark Architecture for Processing Game Data
Evolution of an Apache Spark Architecture for Processing Game Data Nick Afshartous WB Analytics Platform May 17 th 2017 May 17 th, 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead
More informationDeveloping Microsoft Azure Solutions (70-532) Syllabus
Developing Microsoft Azure Solutions (70-532) Syllabus Cloud Computing Introduction What is Cloud Computing Cloud Characteristics Cloud Computing Service Models Deployment Models in Cloud Computing Advantages
More information