Migrating from Oracle to Espresso

Similar documents
Data Infrastructure at LinkedIn. Shirshanka Das XLDB 2011

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Big Data Integration Patterns. Michael Häusler Jun 12, 2017

Extending the Scope of Custom Transformations

Griddable.io architecture

Esper EQC. Horizontal Scale-Out for Complex Event Processing

An Information Asset Hub. How to Effectively Share Your Data

Apache HBase Andrew Purtell Committer, Apache HBase, Apache Software Foundation Big Data US Research And Development, Intel

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Evolution of an Apache Spark Architecture for Processing Game Data

State of the Dolphin Developing new Apps in MySQL 8

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Building LinkedIn s Real-time Data Pipeline. Jay Kreps

5 Fundamental Strategies for Building a Data-centered Data Center

Schema Registry Overview

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

PNDA.io: when BGP meets Big-Data

Oracle NoSQL Database Enterprise Edition, Version 18.1

Managing Data at Scale: Microservices and Events. Randy linkedin.com/in/randyshoup

A Journey to DynamoDB

BIG DATA TECHNOLOGIES: WHAT EVERY MANAGER NEEDS TO KNOW ANALYTICS AND FINANCIAL INNOVATION CONFERENCE JUNE 26-29,

Microservices without the Servers: AWS Lambda in Action

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

A never-ending database migration

Databricks, an Introduction

Big Data Analytics. Rasoul Karimi

Data Acquisition. The reference Big Data stack

Own change. TECHNICAL WHITE PAPER Data Integration With REST API

Verarbeitung von Vektor- und Rasterdaten auf der Hadoop Plattform DOAG Spatial and Geodata Day 2016

DBManager Database operations management at Dropbox

Data Acquisition. The reference Big Data stack

DATABASE SCALE WITHOUT LIMITS ON AWS

Revamped and Automated the infrastructure for NTN Buzztime

Scaling for Humongous amounts of data with MongoDB

Microsoft Big Data and Hadoop

Oracle Big Data Connectors

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Scaling Marketplaces at Thumbtack QCon SF 2017

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Oracle NoSQL Database Enterprise Edition, Version 18.1

An Insider s Guide to Oracle Autonomous Transaction Processing

Data Analytics at Logitech Snowflake + Tableau = #Winning

Data-Intensive Distributed Computing

Personalizing Netflix with Streaming datasets

Scott Meder Senior Regional Sales Manager

Apache Hive for Oracle DBAs. Luís Marques

Database Assessment for PDMS

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

BIG DATA REVOLUTION IN JOBRAPIDO

IBM Data Replication for Big Data

Realtime visitor analysis with Couchbase and Elasticsearch

Because databases are not easily accessible by Hadoop, Apache Sqoop was created to efficiently transfer bulk data between Hadoop and external

ScaleArc for SQL Server

Transformation-free Data Pipelines by combining the Power of Apache Kafka and the Flexibility of the ESB's

Introduction to Oracle NoSQL Database

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Migrating Oracle Databases To Cassandra

Advanced Migration of Schema and Data across Multiple Databases

A Glimpse of the Hadoop Echosystem

Easy Ways to Improve Oracle Information Sharing With Data Replication

Modernizing Business Intelligence and Analytics

Postgres-XC PostgreSQL Conference Michael PAQUIER Tokyo, 2012/02/24

ebay Marketplace Architecture

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan

Lambda Architecture for Batch and Stream Processing. October 2018

This presentation is for informational purposes only and may not be incorporated into a contract or agreement.

Big Data Technology Incremental Processing using Distributed Transactions

Oracle Streams. An Oracle White Paper October 2002

Object Persistence Design Guidelines

Scaling ML in Ad Tech. Giri Iyengar

Introduction to NoSQL

Database Administration. Database Administration CSCU9Q5. The Data Dictionary. 31Q5/IT31 Database P&A November 7, Overview:

Outline. INF3190:Distributed Systems - Examples. Last week: Definitions Transparencies Challenges&pitfalls Architecturalstyles

Tour of Database Platforms as a Service. June 2016 Warner Chaves Christo Kutrovsky Solutions Architect

Oracle Associate User With Schema Difference Between

1

Manual Trigger Sql Server 2008 Insert Multiple Rows At Once

Stages of Data Processing

Architecture of a Real-Time Operational DBMS

Security and Performance advances with Oracle Big Data SQL

New Features Guide Sybase ETL 4.9

CS November 2018

MySQL Group Replication. Bogdan Kecman MySQL Principal Technical Engineer

Evolving To The Big Data Warehouse

FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM

CS November 2017

Survey of Oracle Database

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

August Oracle - GoldenGate Statement of Direction

MongoDB - a No SQL Database What you need to know as an Oracle DBA

Azure Certification BootCamp for Exam (Developer)

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2013/14

Comparing SQL and NOSQL databases

HighQSoft GmbH Big Data ODS. Setting up of a prototype

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Transcription:

Migrating from Oracle to Espresso David Max Senior Software Engineer LinkedIn

About LinkedIn New York Engineering Located in Empire State Building Approximately 100 engineers and 1000 employees total New York Engineering Multiple teams, front end, back end, and data science

About Me Software Engineer at LinkedIn NYC since 2015 Content Ingestion team Office Hours Thursday 11:30-12:00 David Max Senior Software Engineer LinkedIn www.linkedin.com/in/davidpmax/

What is Content Ingestion? Content Ingestion Babylonia

Babylonia Content Ingestion

Babylonia Content Ingestion

url: https://www.youtube.com/watch?v=ms3c9hz0brg title: "SATURN 2017 Keynote: Software is Details Babylonia Content Ingestion image: https://i.ytimg.com/vi/ms3c9hz0brg/hqdefault.jpg?sq poaymweyckgbef5ivfkriqkdcwgbfqaaieiyaxab\\u00 26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg

Babylonia Content Ingestion

What is Content Ingestion? Extracts metadata from web pages Source of Truth for 3 rd party content Also contains metadata for some public 1 st party content Babylonia Content Ingestion Used by LinkedIn services for sharing, decorating, and embedding content Data also feeds into content understanding and relevance models

Babylonia Datasets HDFS ETL Babylonia Content Ingestion Data Change Events

Downstream and Upstream Datasets HDFS ETL Offline Babylonia Content Ingestion Data Change Events Near Line

Babylonia use of Oracle (before migration) RDBMS Relational Management System Databus Platform for streaming data change events to near line consumers Offline ETL to HDFS for offline consumers Schema Metadata extracted from each URL stored in individual rows Client Babylonia the main (but not only) client to directly execute queries on Oracle DB Rest.li Most online interaction with dataset in Oracle via Babylonia s Rest.li API

Espresso is LinkedIn s strategic distributed, fault-tolerant NoSQL What is Espresso? database that powers many of LinkedIn s services ~100 clusters in use* ~420TB of SoT data* ~2 million qps at peak load* * as of August 1, 2017

What is Espresso? NoSQL Non relational Distributed A single database can be distributed over a cluster of machines Scalable Able to scale clusters horizontally by adding more nodes Document A table is a container for documents of the same schema (defined in Avro) Keys Documents index by key fields, which are defined in the table schema

Why Migrate? Maintenance Babylonia s Oracle tables required periodic jobs to be run that involved downtime for each Integration Support for Espresso integrated with other tools and systems at LinkedIn server Rest.li Espresso s API is based on Cost Oracle more expensive to run Strategy Espresso is the preferred platform at LinkedIn for data of this type Support Espresso team part of LinkedIn Rest.li, which makes it easier to treat Espresso endpoints like other LinkedIn Rest.li endpoints Schema Evolution Supported with zero downtime and no coordination with DBA teams

Data Formats (Oracle) Rest.li Pegasus Object Oracle Row Oracle Row Endpoints Oracle Row Oracle HDFS Offline Pegasus ETL Data Babylonia Content Ingestion Oracle Databus Events Complex transformation between Oracle format and Pegasus format Near Line Oracle Row

Pegasus and Avro Pegasus and Avro schema Pegasus Schema Avro Schema definitions are very similar Both can be used to generate Java objects with very similar interfaces Java Objects Java Objects Pegasus schema can be used to auto-generate the Avro schema

Data Formats (Espresso) Rest.li Pegasus Object Espresso Avro Espresso Avro Endpoints Espresso Avro Espresso HDFS Offline Pegasus ETL Data Babylonia Content Ingestion Espresso Brooklin Events Simple transformation between Avro format and Pegasus format Near Line Espresso Avro

Why Migrate? Schema Evolution Espresso ALTER TABLE Document schema auto-registration Not tied to code deployment need to coordinate with DBAs Schema changes are registered automatically as part of the Babylonia deployment process Schema change involves server downtime Backwards compatibility is enforced existing data does not need to be In practice, developers go to great lengths to avoid the hassle transformed Avro schema more natural fit with Schema accumulates tech debt Rest.li Pegasus schema

Zero down time Goals for Migration Process Transparent to Rest.li clients Give offline and nearline consumers time to migrate Validate each step Mirroring in real time

Pre-Migration State of Babylonia Oracle HDFS ETL Offline Babylonia Content Ingestion Oracle Databus Events Near Line

Pre-Migration State of Babylonia Rest.li Endpoints Oracle Rest.li Calls Oracle Databus Events Other Services

Pre-Migration Cleanup Rest.li Endpoints Identify code that is Oracle tightly-coupled to the database Rest.li Calls Oracle Databus Events Decide which code should be reimplemented for Espresso, and which code should be decoupled or eliminated. Other Services Reduce number of code paths to migrate The easiest lines of code to migrate are the lines of code that don t exist

Bootstrap Espresso Oracle HDFS ETL Offline Convert Job Espresso Espresso Bulk Loader Avro Data File

Bootstrap Espresso Oracle HDFS ETL Espresso

Databus Listener, Shadow Read Validation Shadow Read Oracle Validation Oracle Databus Events Espresso Databus Listener

Direct Writes to Espresso Shadow Read Oracle Validation Oracle Databus Events Direct Write Espresso Databus Listener

Resolving Write Conflicts Migration Control optional field added to scheme Dual Write Conflict Databus Listener and Babylonia updating same record indicating which process wrote the record: Bulk Loader, Databus listener, or Babylonia Oracle Databus Events Direct Write Espresso Databus Listener

Espresso New SoT Dual Writes Oracle Deprecated Oracle Databus Events Direct Read/Write Espresso Espresso Brooklin Events

Oracle Turnoff Direct Read/Write Espresso Espresso Brooklin Events

Thank you