Migrating from Oracle to Espresso David Max Senior Software Engineer LinkedIn
About LinkedIn New York Engineering Located in Empire State Building Approximately 100 engineers and 1000 employees total New York Engineering Multiple teams, front end, back end, and data science
About Me Software Engineer at LinkedIn NYC since 2015 Content Ingestion team Office Hours Thursday 11:30-12:00 David Max Senior Software Engineer LinkedIn www.linkedin.com/in/davidpmax/
What is Content Ingestion? Content Ingestion Babylonia
Babylonia Content Ingestion
Babylonia Content Ingestion
url: https://www.youtube.com/watch?v=ms3c9hz0brg title: "SATURN 2017 Keynote: Software is Details Babylonia Content Ingestion image: https://i.ytimg.com/vi/ms3c9hz0brg/hqdefault.jpg?sq poaymweyckgbef5ivfkriqkdcwgbfqaaieiyaxab\\u00 26rs=AOn4CLClwjQlBmMeoRCePtHaThN-qXRHqg
Babylonia Content Ingestion
What is Content Ingestion? Extracts metadata from web pages Source of Truth for 3 rd party content Also contains metadata for some public 1 st party content Babylonia Content Ingestion Used by LinkedIn services for sharing, decorating, and embedding content Data also feeds into content understanding and relevance models
Babylonia Datasets HDFS ETL Babylonia Content Ingestion Data Change Events
Downstream and Upstream Datasets HDFS ETL Offline Babylonia Content Ingestion Data Change Events Near Line
Babylonia use of Oracle (before migration) RDBMS Relational Management System Databus Platform for streaming data change events to near line consumers Offline ETL to HDFS for offline consumers Schema Metadata extracted from each URL stored in individual rows Client Babylonia the main (but not only) client to directly execute queries on Oracle DB Rest.li Most online interaction with dataset in Oracle via Babylonia s Rest.li API
Espresso is LinkedIn s strategic distributed, fault-tolerant NoSQL What is Espresso? database that powers many of LinkedIn s services ~100 clusters in use* ~420TB of SoT data* ~2 million qps at peak load* * as of August 1, 2017
What is Espresso? NoSQL Non relational Distributed A single database can be distributed over a cluster of machines Scalable Able to scale clusters horizontally by adding more nodes Document A table is a container for documents of the same schema (defined in Avro) Keys Documents index by key fields, which are defined in the table schema
Why Migrate? Maintenance Babylonia s Oracle tables required periodic jobs to be run that involved downtime for each Integration Support for Espresso integrated with other tools and systems at LinkedIn server Rest.li Espresso s API is based on Cost Oracle more expensive to run Strategy Espresso is the preferred platform at LinkedIn for data of this type Support Espresso team part of LinkedIn Rest.li, which makes it easier to treat Espresso endpoints like other LinkedIn Rest.li endpoints Schema Evolution Supported with zero downtime and no coordination with DBA teams
Data Formats (Oracle) Rest.li Pegasus Object Oracle Row Oracle Row Endpoints Oracle Row Oracle HDFS Offline Pegasus ETL Data Babylonia Content Ingestion Oracle Databus Events Complex transformation between Oracle format and Pegasus format Near Line Oracle Row
Pegasus and Avro Pegasus and Avro schema Pegasus Schema Avro Schema definitions are very similar Both can be used to generate Java objects with very similar interfaces Java Objects Java Objects Pegasus schema can be used to auto-generate the Avro schema
Data Formats (Espresso) Rest.li Pegasus Object Espresso Avro Espresso Avro Endpoints Espresso Avro Espresso HDFS Offline Pegasus ETL Data Babylonia Content Ingestion Espresso Brooklin Events Simple transformation between Avro format and Pegasus format Near Line Espresso Avro
Why Migrate? Schema Evolution Espresso ALTER TABLE Document schema auto-registration Not tied to code deployment need to coordinate with DBAs Schema changes are registered automatically as part of the Babylonia deployment process Schema change involves server downtime Backwards compatibility is enforced existing data does not need to be In practice, developers go to great lengths to avoid the hassle transformed Avro schema more natural fit with Schema accumulates tech debt Rest.li Pegasus schema
Zero down time Goals for Migration Process Transparent to Rest.li clients Give offline and nearline consumers time to migrate Validate each step Mirroring in real time
Pre-Migration State of Babylonia Oracle HDFS ETL Offline Babylonia Content Ingestion Oracle Databus Events Near Line
Pre-Migration State of Babylonia Rest.li Endpoints Oracle Rest.li Calls Oracle Databus Events Other Services
Pre-Migration Cleanup Rest.li Endpoints Identify code that is Oracle tightly-coupled to the database Rest.li Calls Oracle Databus Events Decide which code should be reimplemented for Espresso, and which code should be decoupled or eliminated. Other Services Reduce number of code paths to migrate The easiest lines of code to migrate are the lines of code that don t exist
Bootstrap Espresso Oracle HDFS ETL Offline Convert Job Espresso Espresso Bulk Loader Avro Data File
Bootstrap Espresso Oracle HDFS ETL Espresso
Databus Listener, Shadow Read Validation Shadow Read Oracle Validation Oracle Databus Events Espresso Databus Listener
Direct Writes to Espresso Shadow Read Oracle Validation Oracle Databus Events Direct Write Espresso Databus Listener
Resolving Write Conflicts Migration Control optional field added to scheme Dual Write Conflict Databus Listener and Babylonia updating same record indicating which process wrote the record: Bulk Loader, Databus listener, or Babylonia Oracle Databus Events Direct Write Espresso Databus Listener
Espresso New SoT Dual Writes Oracle Deprecated Oracle Databus Events Direct Read/Write Espresso Espresso Brooklin Events
Oracle Turnoff Direct Read/Write Espresso Espresso Brooklin Events
Thank you