Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar
|
|
- Albert Welch
- 5 years ago
- Views:
Transcription
1 Luigi Build Data Pipelines of batch jobs - Pramod Toraskar
2 I am a Principal Solution Engineer & Pythonista with more than 8 years of work experience, Works for a Red Hat India an open source solutions company and community-powered approach to provide reliable and High-Performing Cloud, Virtualization, Storage, Linux, and Middleware Technologies. I am responsible for the design, architect, build, and implementation of scalable marketing process automation (Python, MongoDB) deployed to internal PaaS systems, including Openshift 3 (Docker/Kubernetes-based Container Application Platform). As Team/technical for Systems Scalability, a data engineering team which supports operational data pipelines. Also actively participate in open source community programs like Python and Data Science. Has a Diploma in Computer engineering from AISSMS Polytechnic & B.E. in Computer engineering from the MIT Pune. RH_PREStemp_light_v2_0816
3 DATA ENGINEERING FOR MARKETING AUTOMATION Modern marketing automation platforms include some built-in functionality for data flow, data manipulation, and business rule implementation. They are not necessarily built with programmers in mind, and pass up some functionality in favor of robustness and user-friendliness. As processes grew more complex, so did our implementation requirements. We needed to become more rigorous in our testing and version control, and expand our available toolkit to match business needs.
4 LEVERAGING OPENSHIFT AND LUIGI FOR DATA ENGINEERING WORK The data engineer team moving most of operational data pipelines into container-based software deployment and management product called OpenShift. In this process relied heavily on Luigi for workflow orchestration. We will walk through our reasons for moving marketing data pipelines into scripted workflows, our technology choices, a high-level overview of the solution architecture, and a brief overview of what pipelines we have implemented (and are planning to implement). How much data do we deal with? - 36 Millions contact data per year 2.6 to 3 Million contact data on average per month Monthly/daily/hourly reporting Business metric dashboards
5 TECHNOLOGY CHOICE Luigi - workflow engine
6 WHY OPENSHIFT? Moving to Openshift Container Application Platform v3. OS3 is a modern platform for building and deploying Docker containers using Kubernetes. This is very exciting for us as it removes some of the limitations we previously experienced, especially around memory management, job scheduling, and available frameworks. Also a bonus is the built-in Source-to-Image tool, which allowed a much easier migration for us. With S2I, our team just worries about writing scripts, not about building containers.
7 WHAT IS LUIGI Python module to help build complex pipelines. Created by Spotify Dependency Resolution Workflow management Data flow visualization Hadoop support built in Data Storage Clean, filter, join and aggregate data Cassandra Time Series data Luigi is used internally at Spotify to build complex data pipelines Luigi doesn t replace Hadoop, Scalding, Pig, Hive, Redshift. It orchestrates them.
8 LUIGI CONCEPTS Tasks Units of work that produce Outputs Can depend on one or more other tasks Is only run if all dependents are complete Are idempotent Entirely code-based Most other tools are gui-based or declarative and don t offer any abstraction with code you can build anything you want
9 DEFINING A DATA FLOW A bunch of data processing task with inter-dependencies
10 Dependencies Task Targets External API: Contacts Get Contacts S3://data/ contacts.json POST to DWM S3://data/ processed/ contacts.json
11 TASK class class GetContacts(luigi.Task): GetContacts(luigi.Task): def def output(self): output(self): pass pass def def requires(self): requires(self): pass pass def def run(self) run(self) pass pass luigi.run(main_task_cls=mytask) luigi.run(main_task_cls=mytask)
12 LUIGI TASK BREAKDOWN Parametrization To use data flows as command line tools
13 TARGETS & PARAMETERS Dependencies Task Get Contacts date:dateparameter api_method:parameter input() Targets External API: Contacts output() POST to API date:dateparameter chunk_ct:intparameter input() S3://data/ contacts.json output() S3://data/ processed/ contacts.json
14 TARGET Target is simply something that exists or doesn t exist Lots of ready-made targets in Luigi: For example local file a file in a local file system HDFS file S3 key/value target a file in a remote file system SFTP remote target a file in an Amazon S3 bucket SQL table row target a database row in a SQL database Amazon Redshift table row target ElasticSearch target
15 MULTIPLE DEPENDENCIES Dependencies Task input() External API: ELQ External API: HubSpot Get Contacts date:dateparameter api_method:parameter POST to API date:dateparameter chunk_ct:intparameter input() input() External API: Contacts output() input() S3://data/ contacts.json Targets output() S3://data/ processed/ contacts.json
16 DYNAMIC DEPENDENCIES class class LoadAllContact(Luigi.WrapperTask): LoadAllContact(Luigi.WrapperTask): date date == luigi.dateparameter() luigi.dateparameter() def def run(self): run(self): For For file_path file_path in in os.listdir( /data/api_contact_data/*.json ): os.listdir( /data/api_contact_data/*.json ): TransformContactAPIData(file_path) TransformContactAPIData(file_path)
17 WRAPPER TASK class class LoadAllContact(Luigi.WrapperTask): LoadAllContact(Luigi.WrapperTask): date date == luigi.dateparameter() luigi.dateparameter() def def run(self): run(self): yield yield GetContact(self.date) GetContact(self.date) yield yield SyncContact(self.date) SyncContact(self.date) yield yield LoadContactRules(self.date) LoadContactRules(self.date)
18 RUN WITH MULTIPLE WORKERS $ PYTHONPATH dataflow --workers 3 AggergateArtists --date 2018-W11 Streams (date= ) Streams (date= ) Streams (date= ) Streams (date= ) Streams (date= ) Streams (date= ) Streams (date= ) AggergateArtists (date= )
19 LARGE DATA FLOWS (Screenshot from web interface)
20 PROCESS SYNCHRONIZATION Prevents two identical tasks from running simultaneously luigid Simple task synchoronization Data flow 1 Common dependency Task Data flow 2
21 TIPS & TRICKS Save often Save the results of each step They may be useful later on Its super useful for debugging But be ok with regenerating when needed accidentally deleted massive output directory, but was easy ( though time consuming) to recreate only what was needed. Aim small miss small (code small retry small) Shoot for relatively small units of work The pipeline will be easier to understand If there is a task that takes a long time and might fail, easier to deal with
22 Idempotency think it, live it, love it Again, keep things small Write to somewhere else and don t update the source data Tasks should only be changing one thing(if possible) Use atomic writes (where possible) Parallelization can be your friend Luigi can parallelize your workflows But you need to tell it that you want that Default number of workers is 1 Use workers to specify more
23 THINGS WE MISSED OUT There are lots of task types which can be used which we haven t mentioned spark elasticsearch hive pig redis redshift salesforce S3 ecs mongodb mysql pyspark etc. Check out the luigi.contrib package ** Using a persistent task history database, you could train a simple k-nn classifier to predict how long a task will run, Then use this with the dependency graph to predict when any task will finish
24 LUIGI LIMITATIONS It wouldn t be fair not to mention some limitations with the current design: Less useful for near real-time pipelines or continuously running processes. Schedule a few thousand jobs. Doesn t support distributed Execution. Doesn t provide a way to trigger flows.
25 ONWARDS The docs: The mailing list: The source: m/forum/#!forum/luigiuser/ /luigi
26 THANK YOU! We re hiring Python data engineers!!
BIG DATA COURSE CONTENT
BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationSTATE OF MODERN APPLICATIONS IN THE CLOUD
STATE OF MODERN APPLICATIONS IN THE CLOUD 2017 Introduction The Rise of Modern Applications What is the Modern Application? Today s leading enterprises are striving to deliver high performance, highly
More informationThe Evolution of a Data Project
The Evolution of a Data Project The Evolution of a Data Project Python script The Evolution of a Data Project Python script SQL on live DB The Evolution of a Data Project Python script SQL on live DB SQL
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationAWS Serverless Architecture Think Big
MAKING BIG DATA COME ALIVE AWS Serverless Architecture Think Big Garrett Holbrook, Data Engineer Feb 1 st, 2017 Agenda What is Think Big? Example Project Walkthrough AWS Serverless 2 Think Big, a Teradata
More informationUse Case: Scalable applications
Use Case: Scalable applications 1. Introduction A lot of companies are running (web) applications on a single machine, self hosted, in a datacenter close by or on premise. The hardware is often bought
More informationGo Faster: Containers, Platforms and the Path to Better Software Development (Including Live Demo)
RED HAT DAYS VANCOUVER Go Faster: Containers, Platforms and the Path to Better Software Development (Including Live Demo) Paul Armstrong Principal Solutions Architect Gerald Nunn Senior Middleware Solutions
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationScalable Tools - Part I Introduction to Scalable Tools
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session
More informationGabriel Villa. Architecting an Analytics Solution on AWS
Gabriel Villa Architecting an Analytics Solution on AWS Cloud and Data Architect Skilled leader, solution architect, and technical expert focusing primarily on Microsoft technologies and AWS. Passionate
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationArchitectural challenges for building a low latency, scalable multi-tenant data warehouse
Architectural challenges for building a low latency, scalable multi-tenant data warehouse Mataprasad Agrawal Solutions Architect, Services CTO 2017 Persistent Systems Ltd. All rights reserved. Our analytics
More informationHow to go serverless with AWS Lambda
How to go serverless with AWS Lambda Roman Plessl, nine (AWS Partner) Zürich, AWSomeDay 12. September 2018 About myself and nine Roman Plessl Working for nine as a Solution Architect, Consultant and Leader.
More informationActivator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationWHITE PAPER. RedHat OpenShift Container Platform. Benefits: Abstract. 1.1 Introduction
WHITE PAPER RedHat OpenShift Container Platform Abstract Benefits: Applications are designed around smaller independent components called microservices. Elastic resources: Scale up or down quickly and
More informationCase Study: Aurea Software Goes Beyond the Limits of Amazon EBS to Run 200 Kubernetes Stateful Pods Per Host
Case Study: Aurea Software Goes Beyond the Limits of Amazon EBS to Run 200 Kubernetes Stateful Pods Per Host CHALLENGES Create a single, multi-tenant Kubernetes platform capable of handling databases workloads
More informationJenkins: AMPLab s Friendly Butler. He will build your projects so you don t have to!
Jenkins: AMPLab s Friendly Butler He will build your projects so you don t have to! What is Jenkins? Open source CI/CD/Build platform Used to build many, many open source software projects (including Spark
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationRed Hat Roadmap for Containers and DevOps
Red Hat Roadmap for Containers and DevOps Brian Gracely, Director of Strategy Diogenes Rettori, Principal Product Manager Red Hat September, 2016 Digital Transformation Requires an evolution in... 2 APPLICATIONS
More informationMigrating massive monitoring to Bigtable without downtime. Martin Parm, Infrastructure Engineer for Monitoring
Migrating massive monitoring to Bigtable without downtime Martin Parm, Infrastructure Engineer for Monitoring This is a big deal. -- Nicholas Harteau/VP, Engineering & Infrastructure https://news.spotify.com/dk/2016/02/23/announcing-spotify-infrastructures-googley-future/
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationWHITEPAPER. MemSQL Enterprise Feature List
WHITEPAPER MemSQL Enterprise Feature List 2017 MemSQL Enterprise Feature List DEPLOYMENT Provision and deploy MemSQL anywhere according to your desired cluster configuration. On-Premises: Maximize infrastructure
More informationIntro Cassandra. Adelaide Big Data Meetup.
Intro Cassandra Adelaide Big Data Meetup instaclustr.com @Instaclustr Who am I and what do I do? Alex Lourie Worked at Red Hat, Datastax and now Instaclustr We currently manage x10s nodes for various customers,
More informationPersonal Statement. Skillset I MongoDB / Cassandra / Redis / CouchDB. My name is Dale-Kurt Murray. I'm a Solutiof
My name is Dale-Kurt Murray. 'm a Solutiof +1 876 345 7375 Architect who loves new challenging probl :i "rite hello@dalekurtmurray.com which allows me to think outside of the box. visit www.dalekurtmurray.com
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system
More informationTIBCO Complex Event Processing Evaluation Guide
TIBCO Complex Event Processing Evaluation Guide This document provides a guide to evaluating CEP technologies. http://www.tibco.com Global Headquarters 3303 Hillview Avenue Palo Alto, CA 94304 Tel: +1
More informationACCELERATE APPLICATION DELIVERY WITH OPENSHIFT. Siamak Sadeghianfar Sr Technical Marketing Manager, April 2016
ACCELERATE APPLICATION DELIVERY WITH Siamak Sadeghianfar Sr Technical Marketing Manager, OpenShift @siamaks April 2016 IT Must Evolve to Stay Ahead of Demands WA CPU R RAM isc tar SI Jar vm dk MSI nic
More informationTEN LAYERS OF CONTAINER SECURITY
TEN LAYERS OF CONTAINER SECURITY Tim Hunt Kirsten Newcomer May 2017 ABOUT YOU Are you using containers? What s your role? Security professionals Developers / Architects Infrastructure / Ops Who considers
More informationContainer 2.0. Container: check! But what about persistent data, big data or fast data?!
@unterstein @joerg_schad @dcos @jaxdevops Container 2.0 Container: check! But what about persistent data, big data or fast data?! 1 Jörg Schad Distributed Systems Engineer @joerg_schad Johannes Unterstein
More informationAWS Lambda: Event-driven Code in the Cloud
AWS Lambda: Event-driven Code in the Cloud Dean Bryen, Solutions Architect AWS Andrew Wheat, Senior Software Engineer - BBC April 15, 2015 London, UK 2015, Amazon Web Services, Inc. or its affiliates.
More informationCloud-Native Applications. Copyright 2017 Pivotal Software, Inc. All rights Reserved. Version 1.0
Cloud-Native Applications Copyright 2017 Pivotal Software, Inc. All rights Reserved. Version 1.0 Cloud-Native Characteristics Lean Form a hypothesis, build just enough to validate or disprove it. Learn
More informationAzure Certification BootCamp for Exam (Developer)
Azure Certification BootCamp for Exam 70-532 (Developer) Course Duration: 5 Days Course Authored by CloudThat Description Microsoft Azure is a cloud computing platform and infrastructure created for building,
More informationWindows Azure Overview
Windows Azure Overview Christine Collet, Genoveva Vargas-Solar Grenoble INP, France MS Azure Educator Grant Packaged Software Infrastructure (as a Service) Platform (as a Service) Software (as a Service)
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationTransitioning From SSIS to Azure Data Factory. Meagan Longoria, Solution Architect, BlueGranite
Transitioning From SSIS to Azure Data Factory Meagan Longoria, Solution Architect, BlueGranite Microsoft Data Platform MVP I enjoy contributing to and learning from the Microsoft data community. Blogger
More informationDeep Dive Amazon Kinesis. Ian Meyers, Principal Solution Architect - Amazon Web Services
Deep Dive Amazon Kinesis Ian Meyers, Principal Solution Architect - Amazon Web Services Analytics Deployment & Administration App Services Analytics Compute Storage Database Networking AWS Global Infrastructure
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationImplementing the Twelve-Factor App Methodology for Developing Cloud- Native Applications
Implementing the Twelve-Factor App Methodology for Developing Cloud- Native Applications By, Janakiram MSV Executive Summary Application development has gone through a fundamental shift in the recent past.
More informationReal-time Streaming Applications on AWS Patterns and Use Cases
Real-time Streaming Applications on AWS Patterns and Use Cases Paul Armstrong - Solutions Architect (AWS) Tom Seddon - Data Engineering Tech Lead (Deliveroo) 28 th June 2017 2016, Amazon Web Services,
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationAzure Data Factory. Data Integration in the Cloud
Azure Data Factory Data Integration in the Cloud 2018 Microsoft Corporation. All rights reserved. This document is provided "as-is." Information and views expressed in this document, including URL and
More informationTowards a Real- time Processing Pipeline: Running Apache Flink on AWS
Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Dr. Steffen Hausmann, Solutions Architect Michael Hanisch, Manager Solutions Architecture November 18 th, 2016 Stream Processing Challenges
More informationLambda Architecture for Batch and Stream Processing. October 2018
Lambda Architecture for Batch and Stream Processing October 2018 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document is provided for informational purposes only.
More informationFROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà
FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà WHO ARE WE? Marc Sturlese - @sturlese Backend engineer, focused on R&D Interests: search, scalability Dani Solà - @dani_sola Backend engineer
More informationDeploying Applications on DC/OS
Mesosphere Datacenter Operating System Deploying Applications on DC/OS Keith McClellan - Technical Lead, Federal Programs keith.mcclellan@mesosphere.com V6 THE FUTURE IS ALREADY HERE IT S JUST NOT EVENLY
More informationRELIABILITY & AVAILABILITY IN THE CLOUD
RELIABILITY & AVAILABILITY IN THE CLOUD A TWILIO PERSPECTIVE twilio.com To the leaders and engineers at Twilio, the cloud represents the promise of reliable, scalable infrastructure at a price that directly
More informationDesigning MQ deployments for the cloud generation
Designing MQ deployments for the cloud generation WebSphere User Group, London Arthur Barr, Senior Software Engineer, IBM MQ 30 th March 2017 Top business drivers for cloud 2 Source: OpenStack user survey,
More informationCIB Session 12th NoSQL Databases Structures
CIB Session 12th NoSQL Databases Structures By: Shahab Safaee & Morteza Zahedi Software Engineering PhD Email: safaee.shx@gmail.com, morteza.zahedi.a@gmail.com cibtrc.ir cibtrc cibtrc 2 Agenda What is
More informationBig Data. Big Data Analyst. Big Data Engineer. Big Data Architect
Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION
More informationUsing DC/OS for Continuous Delivery
Using DC/OS for Continuous Delivery DevPulseCon 2017 Elizabeth K. Joseph, @pleia2 Mesosphere 1 Elizabeth K. Joseph, Developer Advocate, Mesosphere 15+ years working in open source communities 10+ years
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationContainer in Production : Openshift 구축사례로 이해하는 PaaS. Jongjin Lim Specialist Solution Architect, AppDev
Container in Production : Openshift 구축사례로 이해하는 PaaS Jongjin Lim Specialist Solution Architect, AppDev jonlim@redhat.com Agenda Why Containers? Solution : Red Hat Openshift Container Platform Enterprise
More informationS Implementing DevOps and Hybrid Cloud
S- Implementing DevOps and Hybrid Cloud Srihari Angaluri Lenovo Data Center Group Red Hat Summit // Outline DevOps and Containers Architectural Considerations Lenovo Cloud Technology Center Implementing
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationBuilding a Data-Friendly Platform for a Data- Driven Future
Building a Data-Friendly Platform for a Data- Driven Future Benjamin Hindman - @benh 2016 Mesosphere, Inc. All Rights Reserved. INTRO $ whoami BENJAMIN HINDMAN Co-founder and Chief Architect of Mesosphere,
More informationBuilding High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL
Building High Performance Apps using NoSQL Swami Sivasubramanian General Manager, AWS NoSQL Building high performance apps There is a lot to building high performance apps Scalability Performance at high
More informationData Ingestion at Scale. Jeffrey Sica
Data Ingestion at Scale Jeffrey Sica ARC-TS @jeefy Overview What is Data Ingestion? Concepts Use Cases GPS collection with mobile devices Collecting WiFi data from WAPs Sensor data from manufacturing machines
More informationBuilding/Running Distributed Systems with Apache Mesos
Building/Running Distributed Systems with Apache Mesos Philly ETE April 8, 2015 Benjamin Hindman @benh $ whoami 2007-2012 2009-2010 - 2014 my other computer is a datacenter my other computer is a datacenter
More informationTaming your heterogeneous cloud with Red Hat OpenShift Container Platform.
Taming your heterogeneous cloud with Red Hat OpenShift Container Platform martin@redhat.com Business Problem: Building a Hybrid Cloud solution PartyCo Some Bare Metal machines Mostly Virtualised CosPlayUK
More informationStreamSets Control Hub Installation Guide
StreamSets Control Hub Installation Guide Version 3.2.1 2018, StreamSets, Inc. All rights reserved. Table of Contents 2 Table of Contents Chapter 1: What's New...1 What's New in 3.2.1... 2 What's New in
More informationCloud & container monitoring , Lars Michelsen Check_MK Conference #4
Cloud & container monitoring 04.05.2018, Lars Michelsen Some cloud definitions Applications Data Runtime Middleware O/S Virtualization Servers Storage Networking Software-as-a-Service (SaaS) Applications
More informationDisclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme
CNA2080BU Deep Dive: How to Deploy and Operationalize Kubernetes Cornelia Davis, Pivotal Nathan Ness Technical Product Manager, CNABU @nvpnathan #VMworld #CNA2080BU Disclaimer This presentation may contain
More informationHadoop, Yarn and Beyond
Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets
More informationHDInsight > Hadoop. October 12, 2017
HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond
More informationImportant DevOps Technologies (3+2+3days) for Deployment
Important DevOps Technologies (3+2+3days) for Deployment DevOps is the blending of tasks performed by a company's application development and systems operations teams. The term DevOps is being used in
More informationCIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench
CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been
More informationWhat s New at AWS? A selection of some new stuff. Constantin Gonzalez, Principal Solutions Architect, Amazon Web Services
What s New at AWS? A selection of some new stuff Constantin Gonzalez, Principal Solutions Architect, Amazon Web Services Speed of Innovation AWS Pace of Innovation AWS has been continually expanding its
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationTECHNICAL BRIEF. Scheduling and Orchestration of Heterogeneous Docker-Based IT Landscapes. January 2017 Version 2.0 For Public Use
TECHNICAL BRIEF Scheduling and Orchestration of Heterogeneous Docker-Based IT Landscapes January 2017 Version 2.0 For Public Use Table of Contents 1 Summary... 2 2 Introduction... 2 3 Stonebranch DevOps
More informationApplication monitoring with BELK. Nishant Sahay, Sr. Architect Bhavani Ananth, Architect
Application monitoring with BELK Nishant Sahay, Sr. Architect Bhavani Ananth, Architect Why logs Business PoV Input Data Analytics User Interactions /Behavior End user Experience/ Improvements 2017 Wipro
More informationBuilding a government cloud Concepts and Solutions
Building a government cloud Concepts and Solutions Dr. Gabor Szentivanyi, ULX Open Source Consulting & Distribution Background Over 18 years of experience in enterprise grade open source Based in Budapest,
More informationIngest. Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017
Ingest Aaron Mildenstein, Consulting Architect Tokyo Dec 14, 2017 Data Ingestion The process of collecting and importing data for immediate use 2 ? Simple things should be simple. Shay Banon Elastic{ON}
More informationCloudCenter for Developers
DEVNET-1198 CloudCenter for Developers Conor Murphy, Systems Engineer Data Centre Cisco Spark How Questions? Use Cisco Spark to communicate with the speaker after the session 1. Find this session in the
More informationThe SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.
Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate
More information70-532: Developing Microsoft Azure Solutions
70-532: Developing Microsoft Azure Solutions Exam Design Target Audience Candidates of this exam are experienced in designing, programming, implementing, automating, and monitoring Microsoft Azure solutions.
More informationIntro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect
Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing
More informationWHITEPAPER. Embracing Containers & Microservices for future-proof application modernization
WHITEPAPER Embracing Containers & Microservices for future-proof application modernization The need for application modernization: Legacy applications are typically based on a monolithic design, which
More informationVerteego VDS Documentation
Verteego VDS Documentation Release 1.0 Verteego May 31, 2017 Installation 1 Getting started 3 2 Ansible 5 2.1 1. Install Ansible............................................. 5 2.2 2. Clone installation
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationReport on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt
Report on The Infrastructure for Implementing the Mobile Technologies for Data Collection in Egypt Date: 10 Sep, 2017 Draft v 4.0 Table of Contents 1. Introduction... 3 2. Infrastructure Reference Architecture...
More informationCloudSwyft Learning-as-a-Service Course Catalog 2018 (Individual LaaS Course Catalog List)
CloudSwyft Learning-as-a-Service Course Catalog 2018 (Individual LaaS Course Catalog List) Microsoft Solution Latest Sl Area Refresh No. Course ID Run ID Course Name Mapping Date 1 AZURE202x 2 Microsoft
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationBest Practices and Performance Tuning on Amazon Elastic MapReduce
Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or
More informationAzure Certification BootCamp for Exam (Architect)
Certification BootCamp for Exam 70-534 (Architect) Course Duration: 5 Days Course Authored by CloudThat Description Microsoft is a cloud computing platform and infrastructure, created for building, deploying
More informationModern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.
Modern ETL Tools for Cloud and Big Data Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc. Agenda Landscape Cloud ETL Tools Big Data ETL Tools Best Practices
More informationApache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap
Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation Chris Herrera Hashmap Topics Who - Key Hashmap Team Members The Use Case - Our Need for a Memory
More informationPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page
More informationContainers or Serverless? Mike Gillespie Solutions Architect, AWS Solutions Architecture
Containers or Serverless? Mike Gillespie Solutions Architect, AWS Solutions Architecture A Typical Application with Microservices Client Webapp Webapp Webapp Greeting Greeting Greeting Name Name Name Microservice
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationSpark and Flink running scalable in Kubernetes Frank Conrad
Spark and Flink running scalable in Kubernetes Frank Conrad Architect @ apomaya.com scalable efficient low latency processing 1 motivation, use case run (external, unknown trust) customer spark / flink
More information