The Evolution of a Data Project
|
|
- Margery Flowers
- 5 years ago
- Views:
Transcription
1
2 The Evolution of a Data Project
3 The Evolution of a Data Project Python script
4 The Evolution of a Data Project Python script SQL on live DB
5 The Evolution of a Data Project Python script SQL on live DB SQL on reporting DB
6 The Evolution of a Data Project Python SQL on SQL on Terrible script live DB reporting DB confusion
7 The Evolution of a Data Project Python SQL on SQL on Terrible Hadoop / Spark script live DB reporting DB confusion cluster
8 What needs fixing image: Pexels
9 What needs fixing One cluster: data lock-in. image: Pexels
10 What needs fixing One cluster: data lock-in. Want cluster time? You have to wait. image: Pexels
11 What needs fixing One cluster: data lock-in. Want cluster time? You have to wait. Clusters are underutilized and EXPENSIVE image: Pexels
12 Elastic Big Data Datadog Doug Daniels Director, Engineering
13 What s our big data platform do? WHOM Data Engineers Data Scientists
14 What s our big data platform do? WHOM WHAT Data Engineers Data Scientists do App features Statistical Analysis/ML Ad-hoc investigation
15 What s our big data platform do? WHOM WHAT WITH App features Spark Data Engineers do Statistical Analysis/ML with Hadoop (Pig) Data Scientists Ad-hoc investigation Python (Luigi)
16 Exploring the platform COPIOUS TOOLING ELASTIC COMPUTE CLOUD STORAGE
17
18 CLOUD STORAGE
19 What do we store?
20 150 Integrations and more
21 What s time series data? timestamp metric system.cpu.idle value tags host:i-xyz, role:cassandra,
22 We collect over a trillion of these per day and growing!
23 Where to put the petabytes? Amazon S3. Amazon S3
24 How data gets to S3 AMAZON S3 HIVE METASTORE Internal Format Parquet Metadata GO LUIGI/SPARK/PIG Kafka - Buffer - Sort + Dedupe - Upload Partition + Sort - Write Parquet - Update Metastore
25 Isn t this a job for HDFS?
26 What we don t love about HDFS
27 What we don t love about HDFS Causes the one cluster problem
28 What we don t love about HDFS Causes the one cluster problem Come for the storage, get stuck with the servers
29 What we don t love about HDFS Causes the one cluster problem Come for the storage, get stuck with the servers No Java? No data!
30 S3 is flexible! Read data from as many clusters as you want
31 S3 is flexible! Read data from as many clusters as you want Store unlimited stuff(*) with no management * Accepting laws of physics and your credit card limit
32 S3 is flexible! Read data from as many clusters as you want Store unlimited stuff(*) with no management Rock solid: durability ( ), availability (99.99) * Accepting laws of physics and your credit card limit
33 S3 is flexible! Read data from as many clusters as you want Store unlimited stuff(*) with no management Rock solid: durability ( ), availability (99.99) Access from any programming language * Accepting laws of physics and your credit card limit
34 Decouple data and compute (BREAK THE RULES!)
35 Breaking the rules is fine. In benchmarks: S3 is ~2X slower than HDFS
36 Breaking the rules is fine. In benchmarks: S3 is ~2X slower than HDFS
37 It s not all roses
38 Listing is slooooow (A CAUTIONARY TALE)
39 How to fix slow listing Parallelize it Bigger files
40 No way to quickly move data Intermediate Final HDFS Task write atomic move
41 No way to quickly move data Intermediate Final HDFS Task write atomic move S3 Task write
42 No way to quickly move data Say goodbye to speculative execution
43 No way to quickly move data Say goodbye to speculative execution Say hello to better task timeouts
44 But really: We S3 This is a great system. Data accessible from many clusters Storage is easy to manage It s a multi-language paradise up in here
45 CLOUD STORAGE ELASTIC COMPUTE
46 TRADITIONALLY One cluster to compute it all
47 Instead, we run many, many clusters New cluster for every automated job clusters at a time Median lifetime: 2hrs
48 Why so many clusters?
49 Total isolation We know what s happening and why
50 No more waiting on loaded clusters Tailor each cluster to the work you want to do Scale up when you need results faster Data scientists and data engineers don t have to wait
51 Pick the best hardware for each job == ~30% savings over general purpose hardware c3 for CPU-bound jobs r3 for memory-bound jobs m1.xlarge if you don t care (cheap!)
52 100% spot-instance clusters, all the time.* * (ok, most of the time)
53 Ridiculous savings! 100% spot-instance clusters, all the time.* Disappearing clusters! * (ok, most of the time)
54 How we do spot clusters In the big data platform Bid the on-demand price, pay the spot price
55 How we do spot clusters In the big data platform Bid the on-demand price, pay the spot price Fallback to on-demand instances if you can t get spot
56 How we do spot clusters In the big data platform Bid the on-demand price, pay the spot price Fallback to on-demand instances if you can t get spot Monitor everything: jobs, clusters, spot market
57 How we do spot clusters In the big data platform Bid the on-demand price, pay the spot price Fallback to on-demand instances if you can t get spot Monitor everything: jobs, clusters, spot market Save up to 80% off the on-demand price
58 Monitor the spot price Switch hardware when the market gets volatile
59 We like this strategy a lot! No waiting for the cluster you need No waste from hardware sitting idle Spot clusters are affordable enough to use everywhere Cluster is oversubscribed; everyone waiting in line to do their work Lots of expensive hardware sits idle when everyone s gone
60 What s challenging, though?
61 Many things that disappear.
62 COPIOUS TOOLING ELASTIC COMPUTE CLOUD STORAGE
63 Platform as a service Jobs, Clusters, Schedules, Users, Code, Monitoring, Logs, and more CLI Web and APIs
64 Big Data Platform Architecture DATA Amazon S3
65 Big Data Platform Architecture CLUSTER EMR DATA Amazon S3
66 Big Data Platform Architecture WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3
67 Big Data Platform Architecture STORAGE Metadata DB Queueing Logs WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3
68 Big Data Platform Architecture WEB Web API STORAGE Metadata DB Queueing Logs WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3
69 Big Data Platform Architecture USER CLI API Clients Job Scheduler WEB Web API STORAGE Metadata DB Queueing Logs WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3
70 Big Data Platform Architecture USER CLI API Clients Job Scheduler Datadog Monitoring WEB Web API STORAGE Metadata DB Queueing Logs WORKER Pig Workers Spark Workers Luigi Workers CLUSTER EMR DATA Amazon S3
71 How to find the right cluster when they disappear?
72 Cluster tagging for discovery #anomaly -detection #monitor-report
73 How to monitor many disappearing clusters?
74 Dynamic Monitoring on Tags Dashboards Monitors cluster_tags: anomaly-detection anomaly-detection
75 How to debug problems when the cluster s gone?
76 Debugging In a Post-Cluster World
77 Debugging In a Post-Cluster World Send all logs to S3 HDFS YARN Pig Spark
78 Debugging In a Post-Cluster World Send all logs to S3 Visualize the pipeline HDFS YARN Pig Lipstick for Pig Spark History Server Luigi task flow Spark
79 Debugging In a Post-Cluster World Send all logs to S3 Visualize the pipeline Preserve historical monitoring data HDFS YARN Pig Spark Lipstick for Pig Spark History Server Luigi task flow Keep history, by tag, after the cluster disappears
80 How to handle certain cluster failure in your jobs?
81 Luigi: design for failure. Automatic cleanup and restart A B
82 Luigi: design for failure. Automatic cleanup and restart B
83 Luigi: design for failure. Automatic cleanup and restart
84 Luigi: design for failure. Automatic cleanup and restart
85 COPIOUS TOOLING ELASTIC COMPUTE CLOUD STORAGE
86 Recommendations for Cloud Big Data
87 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS
88 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS Start from EMR if building yourself
89 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS Start from EMR if building yourself Look into a PaaS: Netflix Genie, Qubole, Databricks
90 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS Start from EMR if building yourself Look into a PaaS: Netflix Genie, Qubole, Databricks Tag your clusters for dynamic monitoring
91 Recommendations for Cloud Big Data Use S3 for permanent data, not HDFS Start from EMR if building yourself Look into a PaaS: Netflix Genie, Qubole, Databricks Tag your clusters for dynamic monitoring Design for failure with a workflow tool (Luigi, Airflow)
92 Thanks! Want to work with us on Spark, Hadoop, Kafka, Parquet, and more? jobs.datadoghq.com DM or
Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationBest Practices and Performance Tuning on Amazon Elastic MapReduce
Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or
More informationBig Data Infrastructure at Spotify
Big Data Infrastructure at Spotify Wouter de Bie Team Lead Data Infrastructure September 26, 2013 2 Who am I? According to ZDNet: "The work they have done to improve the Apache Hive data warehouse system
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationBenchmarks Prove the Value of an Analytical Database for Big Data
White Paper Vertica Benchmarks Prove the Value of an Analytical Database for Big Data Table of Contents page The Test... 1 Stage One: Performing Complex Analytics... 3 Stage Two: Achieving Top Speed...
More informationCloudExpo November 2017 Tomer Levi
CloudExpo November 2017 Tomer Levi About me Full Stack Engineer @ Intel s Advanced Analytics group. Artificial Intelligence unit at Intel. Responsible for (1) Radical improvement of critical processes
More informationIntro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect
Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect Igor Roiter Big Data Cloud Solution Architect Working as a Data Specialist for the last 11 years 9 of them as a Consultant specializing
More informationCERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)
CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI) The Certificate in Software Development Life Cycle in BIGDATA, Business Intelligence and Tableau program
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationWhat is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)?
What is Cloud Computing? What are the Private and Public Clouds? What are IaaS, PaaS, and SaaS? What is the Amazon Web Services (AWS)? What is Amazon Machine Image (AMI)? Amazon Elastic Compute Cloud (EC2)?
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationLuigi Build Data Pipelines of batch jobs. - Pramod Toraskar
Luigi Build Data Pipelines of batch jobs - Pramod Toraskar I am a Principal Solution Engineer & Pythonista with more than 8 years of work experience, Works for a Red Hat India an open source solutions
More informationBIG DATA COURSE CONTENT
BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data
More informationBuilding a Data-Friendly Platform for a Data- Driven Future
Building a Data-Friendly Platform for a Data- Driven Future Benjamin Hindman - @benh 2016 Mesosphere, Inc. All Rights Reserved. INTRO $ whoami BENJAMIN HINDMAN Co-founder and Chief Architect of Mesosphere,
More informationData in the Cloud and Analytics in the Lake
Data in the Cloud and Analytics in the Lake Introduction Working in Analytics for over 5 years Part the digital team at BNZ for 3 years Based in the Auckland office Preferred Languages SQL Python (PySpark)
More informationLambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015
Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL May 2015 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved. Notices This document
More informationHadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)
Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:
More informationTechno Expert Solutions An institute for specialized studies!
Course Content of Big Data Hadoop( Intermediate+ Advance) Pre-requistes: knowledge of Core Java/ Oracle: Basic of Unix S.no Topics Date Status Introduction to Big Data & Hadoop Importance of Data& Data
More informationMicrosoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud
Microsoft Azure Databricks for data engineering Building production data pipelines with Apache Spark in the cloud Azure Databricks As companies continue to set their sights on making data-driven decisions
More informationAltus Data Engineering
Altus Data Engineering Important Notice 2010-2018 Cloudera, Inc. All rights reserved. Cloudera, the Cloudera logo, and any other product or service names or slogans contained in this document are trademarks
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationHow to Cloud for Earth Scientists: An Introduction
How to Cloud for Earth Scientists: An Introduction Chris Lynnes, NASA EOSDIS* System Architect *Earth Observing System Data and Information System Outline Cloud Basics What good is cloud computing to an
More informationOverview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::
Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional
More informationFalling Out of the Clouds: When Your Big Data Needs a New Home
Falling Out of the Clouds: When Your Big Data Needs a New Home Executive Summary Today s public cloud computing infrastructures are not architected to support truly large Big Data applications. While it
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationData Analytics at Logitech Snowflake + Tableau = #Winning
Welcome # T C 1 8 Data Analytics at Logitech Snowflake + Tableau = #Winning Avinash Deshpande I am a futurist, scientist, engineer, designer, data evangelist at heart Find me at Avinash Deshpande Chief
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationBlended Learning Outline: Cloudera Data Analyst Training (171219a)
Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills
More informationΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing
ΕΠΛ 602:Foundations of Internet Technologies Cloud Computing 1 Outline Bigtable(data component of cloud) Web search basedonch13of thewebdatabook 2 What is Cloud Computing? ACloudis an infrastructure, transparent
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationEPISODE 527 [0:00:00.3] JM:
EPISODE 527 [INTRODUCTION] [0:00:00.3] JM: Apache Spark is a system for processing large data sets in parallel. The core abstraction of Spark is the RDD, the Resilient Distributed Data set, which is a
More informationReal-time Streaming Applications on AWS Patterns and Use Cases
Real-time Streaming Applications on AWS Patterns and Use Cases Paul Armstrong - Solutions Architect (AWS) Tom Seddon - Data Engineering Tech Lead (Deliveroo) 28 th June 2017 2016, Amazon Web Services,
More informationThings Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam
Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem Zohar Elkayam www.realdbamagic.com Twitter: @realmgic Who am I? Zohar Elkayam, CTO at Brillix Programmer, DBA, team leader, database trainer,
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationData Architectures in Azure for Analytics & Big Data
Data Architectures in for Analytics & Big Data October 20, 2018 Melissa Coates Solution Architect, BlueGranite Microsoft Data Platform MVP Blog: www.sqlchick.com Twitter: @sqlchick Data Architecture A
More informationSAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS
SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS 1. What is SAP Vora? SAP Vora is an in-memory, distributed computing solution that helps organizations uncover actionable business insights
More informationKey aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling
Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment
More informationRead & Download (PDF Kindle) Pro Apache Hadoop
Read & Download (PDF Kindle) Pro Apache Hadoop Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop â the framework of big data. Revised to cover Hadoop 2.0, the book covers the very latest
More informationCloud Analytics and Business Intelligence on AWS
Cloud Analytics and Business Intelligence on AWS Enterprise Applications Virtual Desktops Sharing & Collaboration Platform Services Analytics Hadoop Real-time Streaming Data Machine Learning Data Warehouse
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationModern Data Warehouse The New Approach to Azure BI
Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics
More informationScaling Pinterest. Marty Weiner Level 83 Interwebz Geek
Scaling Pinterest Marty Weiner Level 83 Interwebz Geek Evolution Growth March 2010 Page views per day RackSpace 1 small Web Engine 1 small MySQL DB 1 Engineer + 2 Founders Mar 2010 Jan 2011 Jan 2012 May
More informationAzure Data Factory. Data Integration in the Cloud
Azure Data Factory Data Integration in the Cloud 2018 Microsoft Corporation. All rights reserved. This document is provided "as-is." Information and views expressed in this document, including URL and
More informationCSE 190D Spring 2017 Final Exam Answers
CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join
More informationAccelerate Big Data Insights
Accelerate Big Data Insights Executive Summary An abundance of information isn t always helpful when time is of the essence. In the world of big data, the ability to accelerate time-to-insight can not
More informationScalable Web Programming. CS193S - Jan Jannink - 2/25/10
Scalable Web Programming CS193S - Jan Jannink - 2/25/10 Weekly Syllabus 1.Scalability: (Jan.) 2.Agile Practices 3.Ecology/Mashups 4.Browser/Client 7.Analytics 8.Cloud/Map-Reduce 9.Published APIs: (Mar.)*
More informationTowards a Real- time Processing Pipeline: Running Apache Flink on AWS
Towards a Real- time Processing Pipeline: Running Apache Flink on AWS Dr. Steffen Hausmann, Solutions Architect Michael Hanisch, Manager Solutions Architecture November 18 th, 2016 Stream Processing Challenges
More informationAutomation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi
Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures Hiroshi Yamaguchi & Hiroyuki Adachi About Us 2 Hiroshi Yamaguchi Hiroyuki Adachi Hadoop DevOps Engineer Hadoop Engineer
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationCloud Computing. DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech
Cloud Computing DB Special Topics Lecture (10/5/2012) Kyle Hale Maciej Swiech Managing servers isn t for everyone What are some prohibitive issues? (we touched on these last time) Cost (initial/operational)
More informationPrincipal Software Engineer Red Hat Emerging Technology June 24, 2015
USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging
More informationMATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2
1 Senior Application Engineer The MathWorks Korea 2017 The MathWorks, Inc. 2 Data Analytics Workflow Business Systems Smart Connected Systems Data Acquisition Engineering, Scientific, and Field Business
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationMap Reduce & Hadoop Recommended Text:
Map Reduce & Hadoop Recommended Text: Hadoop: The Definitive Guide Tom White O Reilly 2010 VMware Inc. All rights reserved Big Data! Large datasets are becoming more common The New York Stock Exchange
More informationIn-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet
In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationHDInsight > Hadoop. October 12, 2017
HDInsight > Hadoop October 12, 2017 2 Introduction Mark Hudson >20 years mixing technology with data >10 years with CapTech Microsoft Certified IT Professional Business Intelligence Member of the Richmond
More informationAWS Setup Guidelines
AWS Setup Guidelines For CSE6242 HW3, updated version of the guidelines by Diana Maclean Important steps are highlighted in yellow. What we will accomplish? This guideline helps you get set up with the
More informationDistributed Systems CS6421
Distributed Systems CS6421 Intro to Distributed Systems and the Cloud Prof. Tim Wood v I teach: Software Engineering, Operating Systems, Sr. Design I like: distributed systems, networks, building cool
More informationNew Features and Enhancements in Big Data Management 10.2
New Features and Enhancements in Big Data Management 10.2 Copyright Informatica LLC 2017. Informatica, the Informatica logo, Big Data Management, and PowerCenter are trademarks or registered trademarks
More informationProgramming model and implementation for processing and. Programs can be automatically parallelized and executed on a large cluster of machines
A programming model in Cloud: MapReduce Programming model and implementation for processing and generating large data sets Users specify a map function to generate a set of intermediate key/value pairs
More informationSpatial Analytics Built for Big Data Platforms
Spatial Analytics Built for Big Platforms Roberto Infante Software Development Manager, Spatial and Graph 1 Copyright 2011, Oracle and/or its affiliates. All rights Global Digital Growth The Internet of
More informationMaking the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor
Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,
More informationMap Reduce Group Meeting
Map Reduce Group Meeting Yasmine Badr 10/07/2014 A lot of material in this presenta0on has been adopted from the original MapReduce paper in OSDI 2004 What is Map Reduce? Programming paradigm/model for
More informationBig Data Analytics. Description:
Big Data Analytics Description: With the advance of IT storage, pcoressing, computation, and sensing technologies, Big Data has become a novel norm of life. Only until recently, computers are able to capture
More informationInnovatus Technologies
HADOOP 2.X BIGDATA ANALYTICS 1. Java Overview of Java Classes and Objects Garbage Collection and Modifiers Inheritance, Aggregation, Polymorphism Command line argument Abstract class and Interfaces String
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More informationQuick Install for Amazon EMR
Quick Install for Amazon EMR Version: 4.2 Doc Build Date: 11/15/2017 Copyright Trifacta Inc. 2017 - All Rights Reserved. CONFIDENTIAL These materials (the Documentation ) are the confidential and proprietary
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationElastify Cloud-Native Spark Application with PMEM. Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge
Elastify Cloud-Native Spark Application with PMEM Junping Du --- Chief Architect, Tencent Cloud Big Data Department Yue Li --- Cofounder, MemVerge Table of Contents Sparkling: The Tencent Cloud Data Warehouse
More informationCloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH
Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH Cloud Storage with AWS Cloud storage is a critical component of cloud computing, holding the information used by applications. Big data analytics,
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationApache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.
Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1 We believe data can make what is impossible today, possible tomorrow 2 We empower people to transform complex data into clear
More informationAdaptive Executive Layer with Pentaho Data Integration
Adaptive Executive Layer with Pentaho Data Integration An Introduction to AEL and the AEL Spark Engine Jonathan Jarvis Senior Solutions Engineer / Engineering Services June 26th, 2018 Agenda AEL Overview
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large
More informationBig Data Integration Patterns. Michael Häusler Jun 12, 2017
Big Data Integration Patterns Michael Häusler Jun 12, 2017 ResearchGate is built for scientists. The social network gives scientists new tools to connect, collaborate, and keep up with the research that
More informationFlash Storage Complementing a Data Lake for Real-Time Insight
Flash Storage Complementing a Data Lake for Real-Time Insight Dr. Sanhita Sarkar Global Director, Analytics Software Development August 7, 2018 Agenda 1 2 3 4 5 Delivering insight along the entire spectrum
More informationBatch Processing Basic architecture
Batch Processing Basic architecture in big data systems COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman 2 1 2 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores 3
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationUnlimited Scalability in the Cloud A Case Study of Migration to Amazon DynamoDB
Unlimited Scalability in the Cloud A Case Study of Migration to Amazon DynamoDB Steve Saporta CTO, SpinCar Mar 19, 2016 SpinCar When a web-based business grows... More customers = more transactions More
More informationServers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, #netflixcloud #cassandra12
Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, Netflix @eatupmartha #netflixcloud #cassandra12 1 June 29, 2012 2 3 4 [1] 5 From the Netflix tech blog: Cassandra, our distributed cloud persistence
More informationSmart Data Catalog DATASHEET
DATASHEET Smart Data Catalog There is so much data distributed across organizations that data and business professionals don t know what data is available or valuable. When it s time to create a new report
More informationRsyslog: going up from 40K messages per second to 250K. Rainer Gerhards
Rsyslog: going up from 40K messages per second to 250K Rainer Gerhards What's in it for you? Bad news: will not teach you to make your kernel component five times faster Perspective user-space application
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationBIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG
BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG Prof R.Angelin Preethi #1 and Prof J.Elavarasi *2 # Department of Computer Science, Kamban College of Arts and Science for Women, TamilNadu,
More informationSurvey of the Azure Data Landscape. Ike Ellis
Survey of the Azure Data Landscape Ike Ellis Wintellect Core Services Consulting Custom software application development and architecture Instructor Led Training Microsoft s #1 training vendor for over
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationAccelerating BI on Hadoop: Full-Scan, Cubes or Indexes?
White Paper Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? How to Accelerate BI on Hadoop: Cubes or Indexes? Why not both? 1 +1(844)384-3844 INFO@JETHRO.IO Overview Organizations are storing more
More informationarxiv: v1 [cs.dc] 20 Aug 2015
InstaCluster: Building A Big Data Cluster in Minutes Giovanni Paolo Gibilisco DEEP-SE group - DEIB - Politecnico di Milano via Golgi, 42 Milan, Italy giovannipaolo.gibilisco@polimi.it Sr dan Krstić DEEP-SE
More informationBest practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP
Best practices for building a Hadoop Data Lake Solution CHARLOTTE HADOOP USER GROUP 07.29.2015 LANDING STAGING DW Let s start with something basic Is Data Lake a new concept? What is the closest we can
More information