Apache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.
|
|
- Neil Lambert
- 6 years ago
- Views:
Transcription
1 Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1
2 We believe data can make what is impossible today, possible tomorrow 2
3 We empower people to transform complex data into clear and actionable insights DRIVE CUSTOMER INSIGHTS CONNECT PRODUCTS & SERVICES (IoT) PROTECT BUSINESS 3
4 We deliver the modern platform for machine learning and advanced analytics RUNS ANYWHERE Cloud Multi-cloud On-prem SCALABLE Elastic Cost-effective Lower TCO ENTERPRISE GRADE Secure Performant Compliant 4
5 Cloudera Enterprise The modern platform for machine learning and analytics optimized for the cloud CORE SERVICES DATA SCIENCE ANALYTIC DATABASE OPERATIONAL DATABASE DATA ENGINEERING EXTENSIBLE SERVICES SECURITY GOVERNANCE WORKLOAD MANAGEMENT INGEST & REPLICATION DATA CATALOG STORAGE SERVICES S3 ADLS HDFS KUDU 51
6 6
7 The Enterprise Platform for Data Science and Machine Learning 440x MORE DATA 30B CONNECTED DEVICES The data is now here Modern Platform for Machine Learning and Advanced Analytics Cloudera first to integrate Spark 500 Customers Run Spark on Leading adoption among enterprises 7
8 Data Science at the Enterprise Access For sensitive data, secure clusters are difficult to access. And IT typically doesn t want random packages installed on a secure cluster. Popular open source tools don t easily connect to these environments, or always support Hadoop data formats. Scale Laptops rarely have capacity for medium, let alone big data. This leads to a lot of sampling. Popular frameworks don t easily parallelize on a cluster. Typically code has to get rewritten for production. Developer Experience Notebooks, while awesome, don t easily support virtual environment and dependency management, especially for teams. This makes sharing and reproducibility hard. Notebooks are also challenging to put into production. 8
9 Introducing Cloudera Data Science Workbench Self-service data science for the enterprise Accelerates data science from development to production with: Secure self-service environments for data scientists to work against Cloudera clusters Support for Python, R, and Scala, plus project dependency isolation for multiple library versions Workflow automation, version control, collaboration and sharing 9
10 Hadoop 3 in a Nutshell 10
11 An Abbreviated History of Hadoop Releases Date Release Major Notes First release at the ASF Security, HBase support YARN, NameNode HA, wire compat HDFS encryption, rolling upgrade, node labels Most recent production-quality release line Current release line 11
12 Motivation for Hadoop 3 Upgrade minimum Java version to Java 8 Java 7 end-of-life in April 2015 Many Java libraries now only support Java 8 HDFS erasure coding Major feature that refactored core pieces of HDFS Too big to backport to 2.x Classpath isolation Potentially impacts all clients Other miscellaneous incompatible bugfixes and improvements Hadoop 2.x was branched in years of changes waiting for
13 Hadoop 3 Status and Release Plan A series of alphas and betas leading up to GA Alpha4 feature complete Beta1 compatibility freeze GA by the end of the year Release Date alpha alpha alpha alpha beta Q GA 2017 Q4 13
14 HDFS & Hadoop Features 14
15 3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 15
16 3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 b1 b2 b3 b1 b2 b3 16
17 3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 b1 b2 b3 b1 b2 b3 3 replicas 3 x 3 = 9 total replicas 6 / 3 = 200% overhead! 3 blocks 17
18 Erasure coding (HDFS-7285) Motivation: improve storage efficiency of HDFS ~2x the storage efficiency compared to 3x replication Reduction of overhead from 200% to 40% Uses Reed-Solomon(k,m) erasure codes instead of replication Support for multiple erasure coding policies RS(3,2), RS(6,3), RS(10,4) Can improve data durability RS(6,3) can tolerate 3 failures RS(10,4) can tolerate 4 failures Missing blocks reconstructed from remaining blocks 18
19 3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 19
20 3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 p1 p2 20
21 3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 p1 p = 5 replicas 2 / 3 = 67% overhead! 3 data blocks 2 parity blocks 21
22 3x replication vs. Erasure coding /foo.csv - 3 block file b1 b2 b3 p1 p = 5 replicas 2 / 3 = 67% overhead! 3 data blocks 2 parity blocks /bigfoo.csv - 10 block file... b1 b2 b10 p data blocks 4 parity blocks p = 14 replicas 4 / 10 = 40% overhead! 22
23 EC Reconstruction /foo.csv - 3 block file b1 b2 b3 p1 p2 Reed-Solomon (3,2) 23
24 EC Reconstruction /foo.csv - 3 block file X b1 b2 b3 p1 p2 Reed-Solomon (3,2) 24
25 EC Reconstruction /foo.csv - 3 block file X b1 b2 b3 p1 p2 Reed-Solomon (3,2) Read 3 remaining blocks Run RS to recover b3 b3 New copy of b3 recovered 25
26 EC implications File data is striped across multiple nodes and racks Reads and writes are remote and cross-rack Reconstruction is network-intensive, reads m blocks cross-rack Important to use Intel s optimized ISA-L for performance 1+ GB/s encode/decode speed, much faster than Java implementation CPU is no longer a bottleneck Need to combine data into larger files to avoid an explosion in replica count Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas) Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas) Works best for archival / cold data use cases 26
27 EC performance 27
28 EC performance 28
29 Other new HDFS features Classpath isolation shaded HDFS clients Shell script rewrite Support for multiple Standby NameNodes Intra-DataNode balancer Support for Microsoft Azure Data Lake and Aliyun OSS (Alibaba Cloud) Move default ports out of the ephemeral range 29
30 YARN Features 30
31 Application Timeline Service v2 Resource Manager 31
32 Application Timeline Service v2 jobs Resource Manager 32
33 Application Timeline Service v2 jobs Resource Manager Job History Server 33
34 Application Timeline Service v2 jobs Node Manager Resource Manager Job History Server HDFS 34
35 Application Timeline Service v2 jobs Resource Manager Job History Server Spark History Server 35
36 Application Timeline Service v2 jobs? Resource Manager Job History Server Spark History Server 36
37 Application Timeline Service v2 Resource Manager Timeline Collecter Timeline Timeline Reader Timeline Reader Reader HBase Timeline Collecter Node Manager Application Master 37
38 Application Timeline Service v2 jobs Resource Manager Application Timeline Service HBase 38
39 Application Timeline Service v2 Store for application and system events and data Distributed Scalable Structured Data Model Updated in real time Application status Application metrics System metrics Fed by resource manager, node manager, and application masters REST API 39
40 Application Timeline Service v2 Flows 40
41 New YARN UI Rich client application Built on Node.js and Ember Improved visibility into cluster usage Memory, CPU By queues and applications Sunburst graphs for hierarchical queues NodeManager heatmap ATSv2 integration Plot container start/stop events Easy to capture delays in app execution 41
42 New YARN UI: Cluster Overview 42
43 New YARN UI: Queues 43
44 Before Hadoop 3 memory and CPU are the only managed resources Resource Types allows adding new managed resources Countable resources: GPUs, Disks etc. Resource profiles Resource Types Similar conceptually to EC2 instance types Capture complex resource request Current virtual CPU cores and memory resources work as before 44
45 YARN Federation Admin Policy Resource Manager Node Manager Node Manager Node Manager Node Manager Router Node Manager Resource Manager Node Manager Node Manager Node Manager 45
46 YARN Federation YARN scalability Twitter runs a 10k node cluster with fair scheduler Yahoo! runs 4k node cluster with capacity scheduler Federation Restrict users to sub-clusters based on policy Scalability to 100k nodes and beyond Heterogeneous scheduling policies 46
47 Opportunistic Containers Scheduler s job is to keep all resources busy Scheduling gaps Nothing to run Resource contention Resource reservations Opportunistic containers fill those gaps Requested explicitly Dedicated scheduler Queued at the node managers Scheduled locally when resources are available Preempted when guaranteed containers need to run Coming in 2.9 and
48 Oversubscription Resource utilization is typically low in most clusters (20-50%) Provision for peak usage Usage < Allocation Mean Usage = ½ Peak Usage 48
49 Other new YARN features New YARN UI Node.js application, improved visibility into cluster usage New resource types beyond CPU and memory YARN Federation Improved scalability to 100k node clusters and beyond Opportunistic containers / Oversubscription 49
50 Summary: What s new in Hadoop 3.0? Storage Optimization HDFS: Erasure codes Improved Visibility into Cluster Operations YARN: ATSv2 YARN: New UI Scalability & Multi-tenancy YARN: Federation Improved Utilization YARN: Opportunistic Containers YARN: Oversubscription Refactor Base Lots of Trunk content JDK8 and newer dependent libraries 50
51 Thank you Balazs Gaspar 51
52 Compatibility and Testing 52
53 Compatibility Strong feedback from large users on the need for compatibility Preserves wire-compatibility with Hadoop 2 clients Impossible to coordinate upgrading off-cluster Hadoop clients Will support rolling upgrade from Hadoop 2 to Hadoop 3 Can t take downtime to upgrade a business-critical cluster Not fully preserving API compatibility Dependency version bumps Removal of deprecated APIs and tools Shell script rewrite, rework of Hadoop tools scripts Incompatible bug fixes 53
54 Conclusion Expect Hadoop GA by the end of the year Shiny new features HDFS erasure coding Client classpath isolation YARN ATSv2 YARN Federation Opportunistic containers and oversubscription Great time to get involved in testing and validation 54
HADOOP 3.0 is here! Dr. Sandeep Deshmukh Sadepach Labs Pvt. Ltd. - Let us grow together!
HADOOP 3.0 is here! Dr. Sandeep Deshmukh sandeep@sadepach.com Sadepach Labs Pvt. Ltd. - Let us grow together! About me BE from VNIT Nagpur, MTech+PhD from IIT Bombay Worked with Persistent Systems - Life
More informationModern Data Warehouse The New Approach to Azure BI
Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationUnderstanding the latent value in all content
Understanding the latent value in all content John F. Kennedy (JFK) November 22, 1963 INGEST ENRICH EXPLORE Cognitive skills Data in any format, any Azure store Search Annotations Data Cloud Intelligence
More informationAutomation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures. Hiroshi Yamaguchi & Hiroyuki Adachi
Automation of Rolling Upgrade for Hadoop Cluster without Data Loss and Job Failures Hiroshi Yamaguchi & Hiroyuki Adachi About Us 2 Hiroshi Yamaguchi Hiroyuki Adachi Hadoop DevOps Engineer Hadoop Engineer
More informationBig Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours
Big Data Hadoop Developer Course Content Who is the target audience? Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours Complete beginners who want to learn Big Data Hadoop Professionals
More informationMapR Enterprise Hadoop
2014 MapR Technologies 2014 MapR Technologies 1 MapR Enterprise Hadoop Top Ranked Cloud Leaders 500+ Customers 2014 MapR Technologies 2 Key MapR Advantage Partners Business Services APPLICATIONS & OS ANALYTICS
More informationConfiguring and Deploying Hadoop Cluster Deployment Templates
Configuring and Deploying Hadoop Cluster Deployment Templates This chapter contains the following sections: Hadoop Cluster Profile Templates, on page 1 Creating a Hadoop Cluster Profile Template, on page
More informationBuilding a Data-Friendly Platform for a Data- Driven Future
Building a Data-Friendly Platform for a Data- Driven Future Benjamin Hindman - @benh 2016 Mesosphere, Inc. All Rights Reserved. INTRO $ whoami BENJAMIN HINDMAN Co-founder and Chief Architect of Mesosphere,
More informationYARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa
YARN: A Resource Manager for Analytic Platform Tsuyoshi Ozawa ozawa.tsuyoshi@lab.ntt.co.jp ozawa@apache.org About me Tsuyoshi Ozawa Research Engineer @ NTT Twitter: @oza_x86_64 Over 150 reviews in 2015
More informationThe Evolution of Big Data Platforms and Data Science
IBM Analytics The Evolution of Big Data Platforms and Data Science ECC Conference 2016 Brandon MacKenzie June 13, 2016 2016 IBM Corporation Hello, I m Brandon MacKenzie. I work at IBM. Data Science - Offering
More informationMicrosoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud
Microsoft Azure Databricks for data engineering Building production data pipelines with Apache Spark in the cloud Azure Databricks As companies continue to set their sights on making data-driven decisions
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationSpark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies
Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay 1 Apache Spark - Intro Spark within the Big Data ecosystem Data Sources Data Acquisition / ETL Data Storage Data Analysis / ML Serving 3 Apache
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationEsgynDB Enterprise 2.0 Platform Reference Architecture
EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationBig Data and Object Storage
Big Data and Object Storage or where to store the cold and small data? Sven Bauernfeind Computacenter AG & Co. ohg, Consultancy Germany 28.02.2018 Munich Volume, Variety & Velocity + Analytics Velocity
More informationBig Data for Engineers Spring Resource Management
Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models
More informationIntegrate MATLAB Analytics into Enterprise Applications
Integrate Analytics into Enterprise Applications Dr. Roland Michaely 2015 The MathWorks, Inc. 1 Data Analytics Workflow Access and Explore Data Preprocess Data Develop Predictive Models Integrate Analytics
More informationWhat s New in Apache HTrace by Colin P. McCabe
What s New in Apache HTrace by Colin P. McCabe About Me I work on HDFS and related storage technologies at Cloudera Committer on the Hadoop and HTrace projects. Previously worked on the Ceph distributed
More informationCPSC 426/526. Cloud Computing. Ennan Zhai. Computer Science Department Yale University
CPSC 426/526 Cloud Computing Ennan Zhai Computer Science Department Yale University Recall: Lec-7 In the lec-7, I talked about: - P2P vs Enterprise control - Firewall - NATs - Software defined network
More informationModernizing Business Intelligence and Analytics
Modernizing Business Intelligence and Analytics Justin Erickson Senior Director, Product Management 1 Agenda What benefits can I achieve from modernizing my analytic DB? When and how do I migrate from
More informationMAPR DATA GOVERNANCE WITHOUT COMPROMISE
MAPR TECHNOLOGIES, INC. WHITE PAPER JANUARY 2018 MAPR DATA GOVERNANCE TABLE OF CONTENTS EXECUTIVE SUMMARY 3 BACKGROUND 4 MAPR DATA GOVERNANCE 5 CONCLUSION 7 EXECUTIVE SUMMARY The MapR DataOps Governance
More informationBig Data 7. Resource Management
Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationArcGIS Enterprise: Architecture & Deployment. Anthony Myers
ArcGIS Enterprise: Architecture & Deployment Anthony Myers 1 2 3 4 5 Web GIS Overview of ArcGIS Enterprise Federation & Hosted Server Deployment Patterns Implementation 1 Web GIS ArcGIS Enabling GIS for
More informationEMC Forum 2014 EMC ViPR and ECS: A Lap Around Software-Defined Services. Magnus Nilsson Blog: purevirtual.
EMC Forum 2014 EMC ViPR and ECS: A Lap Around Software-Defined Services Magnus Nilsson magnus.nilsson@emc.com Twitter: @swevm Blog: purevirtual.eu 1 Session Agenda Market Dynamics EMC ViPR Overview What
More informationCertified Big Data and Hadoop Course Curriculum
Certified Big Data and Hadoop Course Curriculum The Certified Big Data and Hadoop course by DataFlair is a perfect blend of in-depth theoretical knowledge and strong practical skills via implementation
More informationMySQL High Availability
MySQL High Availability InnoDB Cluster and NDB Cluster Ted Wennmark ted.wennmark@oracle.com Copyright 2016, Oracle and/or its its affiliates. All All rights reserved. Safe Harbor Statement The following
More informationWe are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info
We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME : LAB TIMINGS : PH NO: 9963799240, 040-40025423
More informationThe Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou
The Hadoop Ecosystem EECS 4415 Big Data Systems Tilemachos Pechlivanoglou tipech@eecs.yorku.ca A lot of tools designed to work with Hadoop 2 HDFS, MapReduce Hadoop Distributed File System Core Hadoop component
More informationNovember 7, DAN WILSON Global Operations Architecture, Concur. OpenStack Summit Hong Kong JOE ARNOLD
November 7, 2013 DAN WILSON Global Operations Architecture, Concur dan.wilson@concur.com @tweetdanwilson OpenStack Summit Hong Kong JOE ARNOLD CEO, SwiftStack joe@swiftstack.com @joearnold Introduction
More informationCertified Big Data Hadoop and Spark Scala Course Curriculum
Certified Big Data Hadoop and Spark Scala Course Curriculum The Certified Big Data Hadoop and Spark Scala course by DataFlair is a perfect blend of indepth theoretical knowledge and strong practical skills
More informationSAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS
SAP VORA 1.4 on AWS - MARKETPLACE EDITION FREQUENTLY ASKED QUESTIONS 1. What is SAP Vora? SAP Vora is an in-memory, distributed computing solution that helps organizations uncover actionable business insights
More informationSoftware-defined Storage: Fast, Safe and Efficient
Software-defined Storage: Fast, Safe and Efficient TRY NOW Thanks to Blockchain and Intel Intelligent Storage Acceleration Library Every piece of data is required to be stored somewhere. We all know about
More informationEC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures
EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures Haiyang Shi, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda {shi.876, lu.932, panda.2}@osu.edu The Ohio State University
More informationKonstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu } Introduction } Architecture } File
More informationCloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 3 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationActivator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.
Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success. ACTIVATORS Designed to give your team assistance when you need it most without
More informationBIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE
BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest
More informationHadoop. Introduction / Overview
Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures
More informationBig Data Hadoop Course Content
Big Data Hadoop Course Content Topics covered in the training Introduction to Linux and Big Data Virtual Machine ( VM) Introduction/ Installation of VirtualBox and the Big Data VM Introduction to Linux
More informationThe SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.
Dublin Apache Kafka Meetup, 30 August 2017 The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Joseph @pleia2 * ASF projects 1 Elizabeth K. Joseph, Developer Advocate Developer Advocate
More informationHyper-Convergence De-mystified. Francis O Haire Group Technology Director
Hyper-Convergence De-mystified Francis O Haire Group Technology Director The Cloud Era Is Well Underway Rapid Time to Market I deployed my application in five minutes. Fractional IT Consumption I use and
More informationFrom Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019
From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationTop 25 Big Data Interview Questions And Answers
Top 25 Big Data Interview Questions And Answers By: Neeru Jain - Big Data The era of big data has just begun. With more companies inclined towards big data to run their operations, the demand for talent
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationEMC Forum EMC ViPR and ECS: A Lap Around Software-Defined Services
EMC Forum 2014 Copyright 2014 EMC Corporation. All rights reserved. 1 EMC ViPR and ECS: A Lap Around Software-Defined Services 2 Session Agenda Market Dynamics EMC ViPR Overview What s New in ViPR Controller
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationLeveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands
Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands Unleash Your Data Center s Hidden Power September 16, 2014 Molly Rector CMO, EVP Product Management & WW Marketing
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationNew Fresh Storage Approach for New IT Challenges Laurent Denel Philippe Nicolas OpenIO
New Fresh Storage Approach for New IT Challenges Laurent Denel Philippe Nicolas OpenIO Agenda Company profile and background Business and Users needs OpenIO approach Competition Conclusion Company profile
More informationDisclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme
VIRT1351BE New Architectures for Virtualizing Spark and Big Data Workloads on vsphere Justin Murray Mohan Potheri VMworld 2017 Content: Not for publication #VMworld #VIRT1351BE Disclaimer This presentation
More informationToward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017
Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store Wei Xie TTU CS Department Seminar, 3/7/2017 1 Outline General introduction Study 1: Elastic Consistent Hashing based Store
More informationChapter 5. The MapReduce Programming Model and Implementation
Chapter 5. The MapReduce Programming Model and Implementation - Traditional computing: data-to-computing (send data to computing) * Data stored in separate repository * Data brought into system for computing
More informationHadoop & Big Data Analytics Complete Practical & Real-time Training
An ISO Certified Training Institute A Unit of Sequelgate Innovative Technologies Pvt. Ltd. www.sqlschool.com Hadoop & Big Data Analytics Complete Practical & Real-time Training Mode : Instructor Led LIVE
More informationTHE ZADARA CLOUD. An overview of the Zadara Storage Cloud and VPSA Storage Array technology WHITE PAPER
WHITE PAPER THE ZADARA CLOUD An overview of the Zadara Storage Cloud and VPSA Storage Array technology Zadara 6 Venture, Suite 140, Irvine, CA 92618, USA www.zadarastorage.com EXECUTIVE SUMMARY The IT
More information利用 Mesos 打造高延展性 Container 環境. Frank, Microsoft MTC
利用 Mesos 打造高延展性 Container 環境 Frank, Microsoft MTC About Me Developer @ Yahoo! DevOps @ HTC Technical Architect @ MSFT Agenda About Docker Manage containers Apache Mesos Mesosphere DC/OS application = application
More informationHadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017
Hadoop File System 1 S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y Moving Computation is Cheaper than Moving Data Motivation: Big Data! What is BigData? - Google
More informationData 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.
Data 101 Which DB, When Joe Yong (joeyong@microsoft.com) Azure SQL Data Warehouse, Program Management Microsoft Corp. The world is changing AI increased by 300% in 2017 Data will grow to 44 ZB in 2020
More informationIntroduction To YARN. Adam Kawa, Spotify The 9 Meeting of Warsaw Hadoop User Group 2/23/13
Introduction To YARN Adam Kawa, Spotify th The 9 Meeting of Warsaw Hadoop User Group About Me Data Engineer at Spotify, Sweden Hadoop Instructor at Compendium (Cloudera Training Partner) +2.5 year of experience
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationAurora, RDS, or On-Prem, Which is right for you
Aurora, RDS, or On-Prem, Which is right for you Kathy Gibbs Database Specialist TAM Katgibbs@amazon.com Santa Clara, California April 23th 25th, 2018 Agenda RDS Aurora EC2 On-Premise Wrap-up/Recommendation
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationOracle NoSQL Database Enterprise Edition, Version 18.1
Oracle NoSQL Database Enterprise Edition, Version 18.1 Oracle NoSQL Database is a scalable, distributed NoSQL database, designed to provide highly reliable, flexible and available data management across
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationMothra: A Large-Scale Data Processing Platform for Network Security Analysis
Mothra: A Large-Scale Data Processing Platform for Network Security Analysis Tony Cebzanov Software Engineering Institute Carnegie Mellon University Pittsburgh, PA 15213 REV-03.18.2016.0 1 Agenda Introduction
More informationIntegrate MATLAB Analytics into Enterprise Applications
Integrate Analytics into Enterprise Applications Aurélie Urbain MathWorks Consulting Services 2015 The MathWorks, Inc. 1 Data Analytics Workflow Data Acquisition Data Analytics Analytics Integration Business
More informationScaling MATLAB. for Your Organisation and Beyond. Rory Adams The MathWorks, Inc. 1
Scaling MATLAB for Your Organisation and Beyond Rory Adams 2015 The MathWorks, Inc. 1 MATLAB at Scale Front-end scaling Scale with increasing access requests Back-end scaling Scale with increasing computational
More informationL5-6:Runtime Platforms Hadoop and HDFS
Indian Institute of Science Bangalore, India भ रत य व ज ञ न स स थ न ब गल र, भ रत Department of Computational and Data Sciences SE256:Jan16 (2:1) L5-6:Runtime Platforms Hadoop and HDFS Yogesh Simmhan 03/
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationBuild an open hybrid cloud and paint it red and blue
Build an open hybrid cloud and paint it red and blue Khaled Elbedri Technical sales lead, Microsoft Ismail Dhaoui EMEA Senior Specialist Solutions Architect, Red Hat Tuesday, May 8, 2018 Agenda RH & MS
More informationSFO15-TR6: Hadoop on ARM
SFO15-TR6: Hadoop on ARM Presented by Nachiket Bhoyar Steve Capper Date Wednesday 23 September 2015 Nachiket Bhoyar Steve Capper Event SFO15 Agenda 1. Quick intro to Hadoop stack. 2. Summary of our work.
More informationHDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
More informationVision of the Software Defined Data Center (SDDC)
Vision of the Software Defined Data Center (SDDC) Raj Yavatkar, VMware Fellow Vijay Ramachandran, Sr. Director, Storage Product Management Business transformation and disruption A software business that
More informationData 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.
17-18 March, 2018 Beijing Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp. The world is changing AI increased by 300% in 2017 Data will grow to 44 ZB in 2020 Today, 80% of organizations
More informationIsilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team
Isilon: Raising The Bar On Performance & Archive Use Cases John Har Solutions Product Manager Unstructured Data Storage Team What we ll cover in this session Isilon Overview Streaming workflows High ops/s
More informationPlanning for the HDP Cluster
3 Planning for the HDP Cluster Date of Publish: 2018-08-30 http://docs.hortonworks.com Contents... 3 Typical Hadoop Cluster...3 Typical Workload Patterns For Hadoop... 3 Early Deployments...4 Server Node
More informationSPLUNK ENTERPRISE AND ECS TECHNICAL SOLUTION GUIDE
SPLUNK ENTERPRISE AND ECS TECHNICAL SOLUTION GUIDE Splunk Frozen and Archive Buckets on ECS ABSTRACT This technical solution guide describes a solution for archiving Splunk frozen buckets to ECS. It also
More informationThe Evolving Apache Hadoop Ecosystem What it means for Storage Industry
The Evolving Apache Hadoop Ecosystem What it means for Storage Industry Sanjay Radia Architect/Founder, Hortonworks Inc. All Rights Reserved Page 1 Outline Hadoop (HDFS) and Storage Data platform drivers
More informationNext-Generation Cloud Platform
Next-Generation Cloud Platform Jangwoo Kim Jun 24, 2013 E-mail: jangwoo@postech.ac.kr High Performance Computing Lab Department of Computer Science & Engineering Pohang University of Science and Technology
More informationBlurring the Line Between Developer and Data Scientist
Blurring the Line Between Developer and Data Scientist Notebooks with PixieDust va barbosa va@us.ibm.com Developer Advocacy IBM Watson Data Platform WHY ARE YOU HERE? More companies making bet-the-business
More informationScalable Tools - Part I Introduction to Scalable Tools
Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session
More informationUsing DC/OS for Continuous Delivery
Using DC/OS for Continuous Delivery DevPulseCon 2017 Elizabeth K. Joseph, @pleia2 Mesosphere 1 Elizabeth K. Joseph, Developer Advocate, Mesosphere 15+ years working in open source communities 10+ years
More informationDell Technologies IoT Solution Surveillance with Genetec Security Center
Dell Technologies IoT Solution Surveillance with Genetec Security Center Surveillance December 2018 H17436 Sizing Guide Abstract The purpose of this guide is to help you understand the benefits of using
More informationHow we built a highly scalable Machine Learning platform using Apache Mesos
How we built a highly scalable Machine Learning platform using Apache Mesos Daniel Sârbe Development Manager, BigData and Cloud Machine Translation @ SDL Co-founder of BigData/DataScience Meetup Cluj,
More informationElastic Cloud Storage (ECS)
Elastic Cloud Storage (ECS) Version 3.1 Administration Guide 302-003-863 02 Copyright 2013-2017 Dell Inc. or its subsidiaries. All rights reserved. Published September 2017 Dell believes the information
More informationAutomatic-Hot HA for HDFS NameNode Konstantin V Shvachko Ari Flink Timothy Coulter EBay Cisco Aisle Five. November 11, 2011
Automatic-Hot HA for HDFS NameNode Konstantin V Shvachko Ari Flink Timothy Coulter EBay Cisco Aisle Five November 11, 2011 About Authors Konstantin Shvachko Hadoop Architect, ebay; Hadoop Committer Ari
More informationComposable Infrastructure for Public Cloud Service Providers
Composable Infrastructure for Public Cloud Service Providers Composable Infrastructure Delivers a Cost Effective, High Performance Platform for Big Data in the Cloud How can a public cloud provider offer
More informationVMware vsphere Big Data Extensions Command-Line Interface Guide
VMware vsphere Big Data Extensions Command-Line Interface Guide vsphere Big Data Extensions 1.1 This document supports the version of each product listed and supports all subsequent versions until the
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationHadoop An Overview. - Socrates CCDH
Hadoop An Overview - Socrates CCDH What is Big Data? Volume Not Gigabyte. Terabyte, Petabyte, Exabyte, Zettabyte - Due to handheld gadgets,and HD format images and videos - In total data, 90% of them collected
More informationHadoop, Yarn and Beyond
Hadoop, Yarn and Beyond 1 B. R A M A M U R T H Y Overview We learned about Hadoop1.x or the core. Just like Java evolved, Java core, Java 1.X, Java 2.. So on, software and systems evolve, naturally.. Lets
More information