How we built a highly scalable Machine Learning platform using Apache Mesos

Similar documents
Building a Data-Friendly Platform for a Data- Driven Future

Using DC/OS for Continuous Delivery

Advanced Continuous Delivery Strategies for Containerized Applications Using DC/OS

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Deploying Applications on DC/OS

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

Designing MQ deployments for the cloud generation

Container 2.0. Container: check! But what about persistent data, big data or fast data?!

CONTINUOUS DELIVERY WITH DC/OS AND JENKINS

Lessons Learned: Deploying Microservices Software Product in Customer Environments Mark Galpin, Solution Architect, JFrog, Inc.

Serverless Architecture Hochskalierbare Anwendungen ohne Server. Sascha Möllering, Solutions Architect

Towards a Real- time Processing Pipeline: Running Apache Flink on AWS

Migrating massive monitoring to Bigtable without downtime. Martin Parm, Infrastructure Engineer for Monitoring

Document Sub Title. Yotpo. Technical Overview 07/18/ Yotpo

Distributed CI: Scaling Jenkins on Mesos and Marathon. Roger Ignazio Puppet Labs, Inc. MesosCon 2015 Seattle, WA

Microservices with Red Hat. JBoss Fuse

How to Keep UP Through Digital Transformation with Next-Generation App Development

@unterstein #bedcon. Operating microservices with Apache Mesos and DC/OS

Scale your Docker containers with Mesos

@joerg_schad Nightmares of a Container Orchestration System

Exploring Cloud Security, Operational Visibility & Elastic Datacenters. Kiran Mohandas Consulting Engineer

Monitoring system for geographically distributed datacenters based on Openstack. Gioacchino Vino

Building/Running Distributed Systems with Apache Mesos

Seagull: A distributed, fault tolerant, concurrent task runner. Sagar Patwardhan

STATE OF MODERN APPLICATIONS IN THE CLOUD

Using the SDACK Architecture to Build a Big Data Product. Yu-hsin Yeh (Evans Ye) Apache Big Data NA 2016 Vancouver

Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.

The Art of Container Monitoring. Derek Chen

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

Designing Fault-Tolerant Applications

Monitor your containers with the Elastic Stack. Monica Sarbu

How to re-invent your IT Architecture. André Christ, Co-CEO LeanIX

Performance Monitoring and Management of Microservices on Docker Ecosystem

Using AWS to Build a Large Scale Dockerized Microservices Architecture. Dr. Oliver Wahlen moovel Group GmbH Frankfurt, 30.

Using Percona Monitoring and Management to Troubleshoot MySQL Performance Issues

Zero to Microservices in 5 minutes using Docker Containers. Mathew Lodge Weaveworks

Develop and test your Mobile App faster on AWS

Cloud & container monitoring , Lars Michelsen Check_MK Conference #4

CIT 668: System Architecture. Amazon Web Services

Containers, Serverless and Functions in a nutshell. Eugene Fedorenko

Data pipelines with PostgreSQL & Kafka

Scalable Streaming Analytics

Monitor your infrastructure with the Elastic Beats. Monica Sarbu

DATA SCIENCE USING SPARK: AN INTRODUCTION

Ruby in the Sky with Diamonds. August, 2014 Sao Paulo, Brazil

Performance Testing in a Containerized World. Paola Rossaro

Processing of big data with Apache Spark

HOW TO PLAN & EXECUTE A SUCCESSFUL CLOUD MIGRATION

Open-Falcon A Distributed and High-Performance Monitoring System. Yao-Wei Ou & Lai Wei 2017/05/22

Pulsar. Realtime Analytics At Scale. Wang Xinglang

DISTRIBUTED SYSTEMS [COMP9243] Lecture 8a: Cloud Computing WHAT IS CLOUD COMPUTING? 2. Slide 3. Slide 1. Why is it called Cloud?

Managing Data at Scale: Microservices and Events. Randy linkedin.com/in/randyshoup

Kubernetes: Twelve KeyFeatures

Ingest. David Pilato, Developer Evangelist Paris, 31 Janvier 2017

Ingest. Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017

Qualys Cloud Platform

CONTINUOUS DELIVERY WITH MESOS, DC/OS AND JENKINS

Personal Statement. Skillset I MongoDB / Cassandra / Redis / CouchDB. My name is Dale-Kurt Murray. I'm a Solutiof

WHITEPAPER. Embracing Containers & Microservices for future-proof application modernization

Innovatus Technologies

Fluentd + MongoDB + Spark = Awesome Sauce

Zombie Apocalypse Workshop

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Microservices without the Servers: AWS Lambda in Action

Top five Docker performance tips

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Rails on HBase. Zachary Pinter and Tony Hillerson RailsConf 2011

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Microsoft Big Data and Hadoop

Cloud Computing & Visualization

Cloud Monitoring as a Service. Built On Machine Learning

Prometheus. A Next Generation Monitoring System. Brian Brazil Founder

Evolving Prometheus for the Cloud Native World. Brian Brazil Founder

The Emergence of the Datacenter Developer. Tobi Knaup, Co-Founder & CTO at

Running Databases in Containers.

Getting Started With Serverless: Key Use Cases & Design Patterns

PARTLY CLOUDY DESIGN & DEVELOPMENT OF A HYBRID CLOUD SYSTEM

Microservices Lessons Learned From a Startup Perspective

Mesosphere and the Enterprise: Run Your Applications on Apache Mesos. Steve Wong Open Source Engineer {code} by Dell

DEVOPS COURSE CONTENT

Getting Started With Amazon EC2 Container Service

Big Data Architect.

Intro Cassandra. Adelaide Big Data Meetup.

Webinar Series TMIP VISION

HDInsight > Hadoop. October 12, 2017

PostgreSQL migration from AWS RDS to EC2

Whooo s calling Whooo?

At Course Completion Prepares you as per certification requirements for AWS Developer Associate.

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan

When (and how) to move applications from VMware to Cisco Metacloud

BEYOND CLOUD HOSTING. Andrew Melck, Regional Manager DACH,

Managing IoT and Time Series Data with Amazon ElastiCache for Redis

AWS Solution Architecture Patterns

Streaming analytics better than batch - when and why? _Adam Kawa - Dawid Wysakowicz_

[Docker] Containerization

Elastic Efficient Execution of Varied Containers. Sharma Podila Nov 7th 2016, QCon San Francisco

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Amazon EC2 Container Service: Manage Docker-Enabled Apps in EC2

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Transcription:

How we built a highly scalable Machine Learning platform using Apache Mesos Daniel Sârbe Development Manager, BigData and Cloud Machine Translation @ SDL Co-founder of BigData/DataScience Meetup Cluj, Romania October 2017 MesosCon Europe @danielsarbe v1.07

Agenda 1 2 3 4 5 6 7 Some Machine Translation context Why we needed a new platform? Architecture overview of our current SaaS solution New platform using micro-services Lesson Learned Demo Scaling micro-services to handle traffic increase Q&A

@danielsarbe Software development background (13+ years) Passionate about people and technology Interest in anything that is related to Scalability, BigData, Machine Learning Currently leading the BigData and Cloud Machine Translation group at SDL Cluj, Romania Co-founder of BigData/DataScience Meetup Cluj 1200+ members 25+ events organized meetups with more 100+ participants workshops with 30 people

2000 2005 2015 Machine Translation quality improvements Technological Progress

Generic Industry Vertical Custom Trained 2000 2005 2015 Machine Translation quality improvements Technological Progress Customization

Generic 2000 Industry Vertical 2005 Custom Trained 2015 Machine Translation quality improvements Technological Progress Customization On average, customized engines handled industry terminology 24% better than standard genetic MT

Adaptive Machine Translation idea Machine translation Machine translation + AdaptiveMT Machine learns during the statistical training process Machine learns during the statistical training process Machine does not learn or improve during the translation process Machine learns from user feedback during the translation process

Learning from Post-Edits Update operation Source MT Output Post-Edit

Update Example - EngFra An update of one of the statistical MT models, the translation model Source No further requirements are needed MT Pas d'autres exigences sont requises PE Aucune exigence supplémentaire n est nécessaire

Update Example: Translation Model Adaptation No further requirements are needed Pas d'autres exigences sont requises Aucune exigence supplémentaire n est nécessaire Good New Translations needed -> requises nécessaire further -> autres supplémentaire no further requirements are -> aucune exigence supplémentaire n est Bad New Translations are -> n est requirements -> exigence no -> aucune Statistical features help choose good rules, and decide when to use them

AMT MT AdaptiveMT progressive impact 0-50 Segments 0 100 200 300 400 500 600 700 800-100 7% -150-200 -250-300 -350 To PE 300 segments with AdaptiveMT, we need 250 fewer edits than without AdaptiveMT -400 Edits

Second use-case Neural MT Rule-based define and build the model by hand Traditional SMT define the model by hand then statistically learn it from data Neural MT design architecture to automatically discover, define, and learn the models from data

Neural MT Uses a deep learning architecture capable of learning the meaning of the text fluent and naturally sounding translation output Neural MT shows significant translation quality improvement over SMT captures both local and global dependencies and can handle longrange word reordering e.g. we observe an impressive 30% improvement on English- German

Neural MT in the Cloud To accommodate NMT in the Cloud we need: new hardware: GPUs flexible infrastructure (new&old engines) break the old implementation(independent services) new modern API (for new clients onboarding)

How can we do this?

What we had before? Our Legacy SaaS solution Mature, Iterativelydeveloped platform >15 billion words translated in average month >200 million translation request/month No P1/P2 Bugs in last 24 months Availability: 99.9x The only large-scale, commercial-grade MT solution other than Google and Microsoft

What we had before? Redundant flows and services based on outdated requirements Translation engines are not modular and is difficult to add new functionality Scaling up-down requires manual intervention and allocation of new VMs Overall, monolithic design that is hard to adopt for new use-cases

A new platform

What do we want to achieve? Key Concepts Scalability Latency Independent (micro-)services Elasticity (auto-scaling) Fault-tolerance & robustness Infrastructure automation Reliable monitoring and alerts

Architecture evolution ebay 5 th generation today Monolithic Perl Monolithic C++ Java microservices Twitter 3 rd generation today Monolithic Rails JS / Rails / Scala microservices Amazon N th generation today Monolithic C++ Java / Scala microservices SDL MT 3 rd generation today Monolithic Rails -> Monolithic Java -> microservices

Our Technology Stack New microservices platform Mesos Cluster manager Scaling Marathon HBase Hadoop NoSQL, column-oriented db Storage ELK stack Grafana Log aggregation, search server(indexing & querying logs) Beautiful metric & analytic dashboards Monitoring and alerts Kafka Messaging system Latency, Faulttolerance Zookeeper Centralized coordination service OpenTSDB Zabbix Real-time metrics Monitoring solution Protocol Buffers Serializing structured data developed by Google Ansible IT automation Infrastructure automation, AWS Cloud Computing Services Elasticity Docker Containerization platform Microservices DropWizard SpringBoot Java8 REST application bootstrap framework Programming language

Lessons Learned No regrets in life. Just lesson learned.

1. Cost efficient Dev/QA/Clone clusters ~40% cost issues found only in aws-qa prod clone has the same IPs/confs as prod AWS periodical cleanup email alerts on no of instances running r3.4xlarge -> r4.4xlarge ($1.33/h -> $1.06/h -no ephemeral 320SSD) ElasticBlockStore(EBS) vs ElasticFileSystem(EFS) reserved instances

2. Security ssh via a single aws bastion machine gpg encryption of confs no clear passwords in git restrict access to specific envs secure Marathon/Kibana/HAProxyUI AWS termination protection

3. Platform high availability Infrastructure allocation +1 node/cluster one instance decommissioned/stopped (AWS EC2/human error) Microservices 2 instances/micoservice unique constraints should be set Test with 5x-10x more traffic Early monitoring on all fronts infrastructure - Zabbix app metrics - OpenTSDB usage stats - ELK external Pingdom + PagerDuty

4. Resource allocation Memory limitations for containers Marathon memory settings were not enforced on container level reported container memory = host memory enforce Xmx, Xms (OOM) crash dumps (mount partitions to have a crash dump) CPU weights(marathon) reduced 1 to 0.1 overprovision

5. Releases are not as easy as expected No downtime releases - simulate 2-4 times the releases in prod-clone - scripts to monitor downtime during deployment - connection draining (killed by default) - messages compatibility (using protobuff) Ansible-ize the manual steps - prod-clone commands run on prod (gpg to fix) - 0 to cluster (in x min)

6. Investigations become more complex Logs file based logs vs centralized logs aggregated logs into ELK(requestId) using stdout -> no disk space on mesos-slaves (disable) under high load the gelf appender caused slowdowns move to log4j-kafka Metrics in OpenTSDB application-specific-metrics Correlation between various sources

7. Independent microservices Keep microservices as independent and as small as possible 30+ microservices a challenge, especially for legacy code(unit tests) continuous refactoring for all microservices

8. Periodically reevaluate assumptions Follow user behavior over time users behavior is different from what we expected API flow changes (v2/v3 for some APIs) speed is more important on some flows(sync) Scale the microservices based on usage

Future improvements

Future improvements Maintenance and evolution of the platform Periodical upgrades of the stack Improve monitoring Auto-scaling based on usage patterns use of aws-spot instances Move all components in Mesos(DC/OS) HBase ElasticSearch Kafka HDFS

Demo Time! Scaling micro-services to handle traffic increase

Questions? @danielsarbe