Building a high-performance, scalable ML & NLP platform with Python. Sheer El Showk CTO, Lore Ai

Similar documents
Invitation to a New Kind of Database. Sheer El Showk Cofounder, Lore Ai We re Hiring!

Lessons Learned: Deploying Microservices Software Product in Customer Environments Mark Galpin, Solution Architect, JFrog, Inc.

Implementing the Twelve-Factor App Methodology for Developing Cloud- Native Applications

Important DevOps Technologies (3+2+3days) for Deployment

The Long Road from Capistrano to Kubernetes

Beyond 1001 Dedicated Data Service Instances

Building an Operating System for AI

Kubernetes made easy with Docker EE. Patrick van der Bleek Sr. Solutions Engineer NEMEA

WHITEPAPER. Pipelining Machine Learning Models Together

Containers, Serverless and Functions in a nutshell. Eugene Fedorenko

Oracle Application Container Cloud

Microservices Architekturen aufbauen, aber wie?

BUILDING MICROSERVICES ON AZURE. ~ Vaibhav

Building Microservices with the 12 Factor App Pattern

AMD EPYC Delivers Linear Scalability for Docker with Bare-Metal Performance

Microservices on AWS. Matthias Jung, Solutions Architect AWS

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

Using AWS to Build a Large Scale Dockerized Microservices Architecture. Dr. Oliver Wahlen moovel Group GmbH Frankfurt, 30.

SCALE AND SECURE MOBILE / IOT MQTT TRAFFIC

Understanding the latent value in all content

Sample Title. Magento 2 performance comparison in different environments. DevelopersParadise 2016 / Opatija / Croatia

Hi! NET Developer Group Braunschweig!

Fixing the "It works on my machine!" Problem with Docker

YOUR APPLICATION S JOURNEY TO THE CLOUD. What s the best way to get cloud native capabilities for your existing applications?

TALK 1: CONVINCE YOUR BOSS: CHOOSE THE "RIGHT" DATABASE. Prof. Dr. Stefan Edlich Beuth University of Technology Berlin (App.Sc.)

Using GPUaaS in Cloud Foundry

To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservices on CloudFoundry. Tony Erwin,

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Accenture Cloud Platform Serverless Journey

Zero to Microservices in 5 minutes using Docker Containers. Mathew Lodge Weaveworks

Rok: Decentralized storage for the cloud native world

Microservices. Chaos Kontrolle mit Kubernetes. Robert Kubis - Developer Advocate,

Pontoon An Enterprise grade serverless framework using Kubernetes Kumar Gaurav, Director R&D, VMware Mageshwaran R, Staff Engineer R&D, VMware

Microservices Smaller is Better? Eberhard Wolff Freelance consultant & trainer

How to Keep UP Through Digital Transformation with Next-Generation App Development

Nevin Dong 董乃文 Principle Technical Evangelist Microsoft Cooperation

Managing Data at Scale: Microservices and Events. Randy linkedin.com/in/randyshoup

Container in Production : Openshift 구축사례로 이해하는 PaaS. Jongjin Lim Specialist Solution Architect, AppDev

Flip the Switch to Container-based Clouds

Click to edit Master title style

Welcome to Docker Birthday # Docker Birthday events (list available at Docker.Party) RSVPs 600 mentors Big thanks to our global partners:

IBM Compose Managed Platform for Multiple Open Source Databases

Processing of big data with Apache Spark

Use Case: Scalable applications

Real Life Web Development. Joseph Paul Cohen

Learn. Connect. Explore.

WHITE PAPER. RedHat OpenShift Container Platform. Benefits: Abstract. 1.1 Introduction

API, DEVOPS & MICROSERVICES

On-Premises Cloud Platform. Bringing the public cloud, on-premises

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

When (and how) to move applications from VMware to Cisco Metacloud

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Integrate MATLAB Analytics into Enterprise Applications

Azure DevOps. Randy Pagels Intelligent Cloud Technical Specialist Great Lakes Region

Back-end architecture

Java Architectures A New Hope. Eberhard Wolff

Serverless Architecture Hochskalierbare Anwendungen ohne Server. Sascha Möllering, Solutions Architect

Docker and Oracle Everything You Wanted To Know

ESOC s Successes, Complications and Opportunities in using Cloud Computing and Big Data Technology

Deep Learning on AWS with TensorFlow and Apache MXNet

Achieving Continuous Delivery - Micro Services. - Vikram Gadang

the road to cloud native applications Fabien Hermenier

Microservices at Netflix Scale. First Principles, Tradeoffs, Lessons Learned Ruslan

Ingest. Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017

Stream and Batch Processing in the Cloud with Data Microservices. Marius Bogoevici and Mark Fisher, Pivotal

Handling Microservices with Kubernetes - Basic Info

Continuous Integration and Delivery with Spinnaker

Developing Enterprise Cloud Solutions with Azure

Application Centric Microservices Ken Owens, CTO Cisco Intercloud Services. Redhat Summit 2015

OpenShift Roadmap Enterprise Kubernetes for Developers. Clayton Coleman, Architect, OpenShift

利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC

Scaling Marketplaces at Thumbtack QCon SF 2017

FIVE REASONS YOU SHOULD RUN CONTAINERS ON BARE METAL, NOT VMS

Read & Download (PDF Kindle) Python Parallel Programming Cookbook

IBM Advantage: IBM Watson Compare and Comply Element Classification

Microservices Lessons Learned From a Startup Perspective

The four forces of Cloud Native

Kubernetes Integration with Virtuozzo Storage

Building a Data-Friendly Platform for a Data- Driven Future

Elasticsearch & ATLAS Data Management. European Organization for Nuclear Research (CERN)

AWS 101. Patrick Pierson, IonChannel

Social Science Text Analysis with Python (&..)

Copyright 2016 Pivotal. All rights reserved. Cloud Native Design. Includes 12 Factor Apps

Scaling DreamFactory

In-cluster Open Source Testing Framework

Cloud Native Applications. 主讲人 :Capital One 首席工程师 Kevin Hoffman

Docker Enterprise Edition on Cisco UCS C220 M5 Servers for Container Management

Striped Data Server for Scalable Parallel Data Analysis

Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar

Building a government cloud Concepts and Solutions

Onto Petaflops with Kubernetes

Analytics Platform for ATLAS Computing Services

How we built a highly scalable Machine Learning platform using Apache Mesos

What s New in Red Hat OpenShift Container Platform 3.4. Torben Jäger Red Hat Solution Architect

Zombie Apocalypse Workshop

Securing Microservice Interactions in Openstack and Kubernetes

Storage Strategies for vsphere 5.5 users

CLOUD-NATIVE APPLICATION DEVELOPMENT/ARCHITECTURE

Real World Experiences: Cloud Foundry on Windows

Transcription:

Building a high-performance, scalable ML & NLP platform with Python Sheer El Showk CTO, Lore Ai www.lore.ai

Lore is a small startup focused on developing and applying machine-learning techniques to solve a wide array of business problems. We are product-focused but we also provide consulting services (because building and selling products is hard!)

We are self-funded so we need to be cost-effective but also flexible and scalable. - Need to be ready to pivot if product is not working Need to support both consulting work and product development Over time developed a complex but modular stack: - ML & NLP tools/services Standard web-stack: DB, caching, API servers, web servers Fully scalable (all components can be parallelized) This has allowed us to develop & maintain several products: - A chatbot A content-based recommender engine Our flagship product, Salient, a powerful document analysis engine

The Task

Knowledge Graph Client Data ML/NLP Services -Question-Answering -Document Similarity -Information extraction ETL Client Knowledge Graph NLP Pipeline -Grammar parsing -Parts-of-speech -Entities (people,etc..) Upload Docs Entity Linking Web/API Servers Client Convert & Parse NLP Processing Annotated Documents

Applications Contract Management: automate labelling database of contracts (PDFs, word, OCR) Contract type, expiration date, other parties, important clauses, etc.. Analyze news Build a database of product releases or funding rounds from press releases Find companies fitting a profile -- competitors, acquisition, investors, etc.. Patent and Policy Document Analysis Build an ontology and network linking concepts and content

The Challenges

Lore s stack has all the normal challenges of developing a large web/business application in python. The addition of ML/NLP brings additional challenges whose solutions are less well known.

Scalability Some use cases require processing millions of documents Python considered slow and not scalable (hard to parallelize) E.g. we have a news DB with 4 million articles in it Java preferred language of enterprise ML brings new problems: Need models to be very performant Parallelizing harder: need to share models/data between servers

Maintainability Mixing many different technologies ML/compute stack: theano, gensim, spacy, etc.. Web stack: django, celery, mysql, etc.. Multiple servers/services can mean expensive/difficult devops. Large code base with different kinds of code (JS/web vs ML/NLP) make it hard to coordinate between developers.

Flexibility Business requirements change very quickly Need to provide a wide range of services from same platform Need to be able to easily upgrade/improve parts of system Agility often requires integrating existing off-the-shelf solutions to solve non-core tasks. Easy to deploy or scale a deployment.

Where we Started

Where we started... Monolithic python-based server + PHP UI MySQL DB backend Hard coded dependence on target document set. Basic off-the-shelf NLP tools NLTK, etc.. Serial pipeline No redundancy or scalability Hard to install/maintain (all manual)

A few pivots and consulting gigs later...

Kubernetes & docker for devops Celery for parallelization Django API Logging Service Broker Web scrapers NLP Services In-house NLP engine Input Processor KG Service Pipeline Entity Generator Distributed data stores: Mongo, Minio, Redis Very modular, re-usable MySQL Cron-jobs Micro-services Architecture Celery Workers Django Web Load Balancer The new stack... Rapid deployment (even cluster) (soon ElasticSearch) Mongo Cluster Minio Redis (distributed blob storage) (distributed caching)

What we learned along the way...

Lessons Devops: kubernetes, docker, docker-compose Parallelization: celery, dask, joblib,... Modularity: (stateless) microservices architecture Persistence: scalable distributed caching/persistence (redis, mongo, ES,...) Future-proofing: wrapper patterns Performance: cython or pythran for bottleneck code

Docker & Kubernetes Dockerize early, dockerize often 1. Uniform environment between devs a. 2. Easy to add services a. 3. Solve dev problems using devops! Fast, consistent deploy to production a. 4. Use ipython to develop in stack Deploying our stack is rate-limited by download speed :-) type AWS (reserved) Hetzner (bare metal) 16 threads 64 gb ram $360/mo (no storage) $70/mo (1tb ssd) GPU server $450/mo (K80) $120/mo (1080) Cut costs by running on bare metal a. Run your own AWS with k8ns

Celery for Parallelization 10 minute parallelization Many interesting options for parallelization: Celery complicated -- requires Redis/RabbitMQ, etc.. Very easy with docker/kubernetes NOTE: Redis has much lower latency Transparent parallelization Dask.distributed, joblib, celery, ipython Wrap entry-point functions in a task and Bob s your uncle. Lots of fancy features but don t need to use them until you need them.

Stateless Microservices Unlimited Scalability and Flexibility High level analog of an Abstract Base Class Split codebase into independent microservices Each microservices handles one kind of thing: NLP, logging, DB persistence, etc Wrap underlying service abstractly: redis kv_cache, minio kv_store, mysql sql_db Applications can include multiple microservices talking to each other Microservices should be stateless All state (caching, persistence, etc..) should be handled by external services (DB, redis, etc..) Microservices encapsulate underlying implementation of a service Microservices are not micro-servers: services are just APIs within an API. Should not be tied to an interface (REST, JSON, etc..)

Distributed Persistence Let others do the hard work All persistence handled by layers of persistence services. Persistent services accessed via wrappers (see next slide) for easy replacement. State of services managed by distributed cache NLP Service NLP Service create/ train Layering helps performance NLP Service Neural network ML models can be shared by workers using e.g. Redis (caching) and Minio (persistence) cache Different types of persistence for different problems: Mongo stores JSON, Minio stores blobs, mysql stores tables Minio (distributed blob storage) save Redis (distributed caching)

Design Patterns Wrapper Pattern Examples Wrap access to all external services (db, logging, etc..) Local Centralized logging & config with just a few lines of code. Easy to swap in new versions Migrating from NLTK to better (in-house) NLP Restricting user access to DB by transparently replacing tables with views. MySQL vs Postgres, etc. Easily add logging, performance, etc.. Insert custom logic to modify the behavior, eg Manually shard/replicate DBs Add new logging destinations This pattern has saved us countless hours of refactoring!

Performance Use native types correctly: set vs list, iteritems vs items Pythonic code is almost always faster. Use %timeit to test code snippets everywhere Beware of hidden memory allocation For critical bottlenecks use: Pythran (very easy but limited coverage) Cython (harder, but more flexible) For ML use generators to stream data from disk/db (see gensim). Know your times

Performance (II) Example Classifying (short) documents Initial rate: 2k docs/sec Initial Profiling: Db read rate: 250k docs/s Feature generation 2k docs/s Classification 10k docs/s Feature generation involves for-loops & complex logic. Fixes Refactored feature generation: Extract features in list comprehensions Convert to vectors in Pythran code New times Python part: 10k docs/s Pythran: 40k docs/s Fix memory allocation in classifier: 50k docs/s New times: ~10k docs/s Going forward: cache features in Mongo?

Further Reading... Radim Řehůřek (gensim): Does Python Stand a Chance in Today s World of Data Science (https://youtu.be/jfbgt3kjwfq) High Performance Python (O Reilly) Lessons from the Field (chapter 12) Fluent Python (O Reilly) Latency Numbers Every Programmer Should Know https://gist.github.com/jboner/2841832

Bye!

Open Problems Cores vs RAM Many models require lots of RAM (models can be Gbs in size). Models read-only but because of python GIL hard to share memory between processes/threads Use sharedmem module? Generating models from terabytes of data? Embedding models can capture interesting information from huge amounts of data Train very quickly (millions words/sec per server) How can we distribute parameters/data between large number of workers?