PROCESSING THE GAIA DATA IN CNES: THE GREAT ADVENTURE INTO HADOOP WORLD
|
|
- Verity Fox
- 5 years ago
- Views:
Transcription
1 CHAOUL Laurence, VALETTE Véronique CNES, Toulouse PROCESSING THE GAIA DATA IN CNES: THE GREAT ADVENTURE INTO HADOOP WORLD BIDS 16, March 15-17th 2016
2 THE GAIA MISSION AND DPAC ARCHITECTURE AGENDA THE DPCC ARCHITECTURE FIRST DPCC RESULTS FIRST LESSONS LEARNED ON HADOOP 2
3 THE GAIA MISSION AND DPAC ARCHITECTURE AGENDA THE DPCC ARCHITECTURE FIRST DPCC RESULTS FIRST LESSONS LEARNED ON HADOOP 3
4 The Gaia Mission An ESA mission to build a 3D map of 1 billion stars from our Galaxy Gaia launched on December 19th 2013 by a Soyouz-Fregat from Kourou The 2 ton satellite is orbiting around Lagrange L2 point, 1.5 million kilometres from Earth The mission is foreseen to last at least 5 years Scientific data processing delegated to the DPAC Consortium 4
5 The Gaia data flow ~30 GB/day 8hrs/day ESOC MOC Cebreros / New Norcia ESAC SOC 5 Credit: DPACE
6 DPC various architectures overview Postgres-XC 6 Credit : DPACE
7 AGENDA THE GAIA MISSION, DPAC CHALLENGES AND ARCHITECTURE THE DPCC ARCHITECTURE FIRST DPCC RESULTS FIRST LESSONS LEARNED ON HADOOP 7
8 DPCC Architecture DPCC challenges to fulfill 1 billion stars to process, each star seen in average 80 times Some tables up to 80 billions rows High Level of parallelization with several chains running at the same time Various kinds of complex algorithms (object by object or global processings) Increasing volume all along the mission, up to 3PB Hadoop solution selected in 2010 Horizontal linear scalability allowing storage and computing power growth Parallelization with process localization Map Reduce and Resource Manager (YARN) : highly distributed processing HDFS: distributed data storage PHOEBUS to orchestrate all the processings in the cluster 8 8
9 The DPCC platforms Based on standard servers Dell PowerEdge *Intel E CPU with 6 cores@2,5ghz» 64 GB RAM (5.33 GB/core)» 3*4TB 7.2kRPM disks Operational Platform 2*Intel E CPU with 10 cores@2,5ghz 128 GB RAM (6.4 GB/core) 3*4TB 7.2kRPM disks Servers bought progressively, according to computing and storage needs 72 Calculus nodes (1152 cores/240 TB HDFS/20TB GFS) today 400 Datanodes (~6000 cores) at the end of the mission Validation platform Same architecture as the OPS one 28 Datanodes 512 cores 100 TB effective HDFS 8 TB GFS 9 BIDS Tenerife - March 15th-17th 2016
10 GaiaWeb: Data access for the scientific community Web portal enabling the scientific community to access data produced and stored on DPCC clusters Hard-wired statistical plots Processing chains behavior / status follow-up Several million of indexes into an ElasticSearch engine: real-time/reactive access On-demand queries: To extract data for validation purposes / problems investigation SQL-like queries translated into jobs submitted on the cluster, executed as Map/Reduce tasks 10
11 AGENDA THE GAIA MISSION, DPAC CHALLENGES AND ARCHITECTURE THE DPCC ARCHITECTURE FIRST DPCC RESULTS FIRST LESSONS LEARNED ON HADOOP 11
12 First DPCC results using Hadoop (1/2) Daily chains executed on the 1152 cores of the OPS platform : Spectroscopic Daily chain in routine mode Executed everyday as soon as data are received from DPCE Around 10 millions observations processed in about 6hours Solar System Objects Daily chain Not yet run in routine mode (still need some scientific improvements) But the chain manages to process about 64 Millions of observations (with solar objects) on 2 hours Both chains can run in parallel, with a good distribution of the jobs in all the available cores. => The relevance of the Hadoop solution is proven 12
13 First DPCC results using Hadoop (2/2) Cyclic chains executed on the VAL platform : CU4 NSS CU8 Apsis Total Duration # Objects Duration (ms/objects) Total Duration # Objects Duration (ms/objects) Insertion 03:08: ,002 01:06: ,003 Ingestion 14:00: ,042 06:40: ,018 Processing 02:00: ,037 00:07: ,091 10:10: ,304 Performances strongly dependent on the kind of processings Insertion in HDFS of input data : highly distributed Ingestion : transformation of input data into objects specific to the chain» A lot of joins between different tables, so every data has to be read at least once» Can include some scientific computations, so various performances results Processing» Linked to the scientific algorithms themselves 13
14 AGENDA THE GAIA MISSION, DPAC CHALLENGES AND ARCHITECTURE THE DPCC ARCHITECTURE FIRST DPCC RESULTS FIRST LESSONS LEARNED ON HADOOP 14
15 Performances (1/3) Performances monitoring Performances follow-up very complex A lot of statistics given by Hadoop API for each job, the chain performances view shall be consolidated separately Difficult to extrapolate the chain performances, as it is dependent on the other chains running in parallel some tools are being developed in DPCC to aggregate all these statistics and to make a close-monitoring of performances of each chain To anticipate possible overflow To size the next purchase of hardware 15
16 Performances (2/3) Performances closely linked to the design of the chain Steps with different data access typology (steps that need all the data or filters applied) Consequence : Some steps are fully scalable, others not. 16
17 Performances (3/3) Performances closely linked to the design of the chain 17 Each step of the chain is designed to be executed star by star => highly scalable design The design of the chain shall be scalable to fully benefit from Hadoop mechanisms Need to be anticipated at the beginning of development
18 Hadoop fine tuning (1/3) Quite complex fine tuning to obtain optimised performances Configuration of Hadoop queues Allocate an Hadoop queue for each chain to ensure a given fraction of the cluster capacity to this chain Need to find a good configuration to avoid reserving too many resources to a chain 18
19 Hadoop fine tuning (2/3) Quite complex fine tuning to obtain optimised performances Configuration of Hadoop queues The queue elasticity option A given chain is authorized to use the resources of another queue if not used Good usage of all the available resources Dynamic allocation of resources, so very difficult to analyse the performances of the chain 19 The pre-emption option A job can kill another job that would have overflowed in its queue The priority defined by the queues are respected But a job that is running since hours can be killed whereas it is almost done
20 Hadoop fine tuning (3/3) Storage management / replication tuning Replication 3 is recommended But the replication can be configured for each data written in HDFS => ability to tune the replication according to the criticality of the produced data and the available storage 20
21 Hardware management Very good management of heterogeneous machines Servers of different generations, with different characteristics (number of nodes, memory, disk space) Transparent to DPCC operations A lot of available tools to ease the deployment Complete server lifecycle (from out of the box to the production) Same management and monitoring for 10 or 500 servers 21
22 Conclusions / Perspectives Promising results, but further/deeper performances monitoring still needed To allow Hadoop fine-tuning and to consolidate hardware selection for next purchase The next steps : GDR1 (Aug 2016) : Positions (a, d) and G-magnitudes, at least 90 % of the sky can be covered (objects with single star behavior) July 14 Aug 15 May 16 Dec 16 Dec 17 Cycle 0 13 months Cycle 1 8 months We are here Cycle 2 7 months Cycle 3 12 months Daily chains: CU4/SSO-ST & CU6 Daily GDR2 (Mid-2017) : Five parameter astrometric solution of objects with single star behavior will be released (90% of the sky), Integrated photometry BP/RP, mean radial velocities for objects showing no radial velocity variation GDR3 (Jan 2021): Final catalogue release 2019/2020 End of mission Reprocessing cycles CU6 Global R2 CU8 Apsis CU4 SSO-LTa CU4 NSS CU4 SSO-LTa CU4 NSS CU6 Global R2 CU8 Apsis CU4 SSO-LTb CU4 NSS 22 BIDS 2016 Tenerife - CU4 March EO 15th-17th 2016 CU4 EO CU4 EO
23 Questions?
Gaia Catalogue and Archive Plans and Status
Gaia Catalogue and Archive Plans and Status 29 June 2009 AS Gaia, Besançon William O Mullane Gaia Science Operations Development Manager Madrid 1 A little background Already heard about the Satellite from
More informationGAIA CU6 Bruxelles Meeting (12-13 october 2006)
GAIA CU6 Bruxelles Meeting (12-13 october 2006) Preparation of CNES DPC Infrastructure Technology studies prepared by F. Jocteur Monrozier Context: GAIA CNES Infrastructure: Functional blocks import /
More informationBIDS 2016 Santa Cruz de Tenerife
BIDS 2016 Santa Cruz de Tenerife 1 EUCLID: Orchestrating the software development and the scientific data production in a map reduce paradigm Christophe Dabin (CNES) M. Poncet, K. Noddle, M. Holliman,
More informationTechnological Challenges in the GAIA Archive
Technological Challenges in the GAIA Archive Juan Gonzalez jgonzale at sciops.esa.int Jesus Salgado jsalgado at sciops.esa.int ESA Science Archives Team IVOA Interop 2013, Heidelberg May 2013 Presentation
More informationHIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS
HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS Proven Companies and Products Fusion-io Leader in PCIe enterprise flash platforms Accelerates mission-critical applications
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationHigh Performance Computing on MapReduce Programming Framework
International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming
More informationCCA-410. Cloudera. Cloudera Certified Administrator for Apache Hadoop (CCAH)
Cloudera CCA-410 Cloudera Certified Administrator for Apache Hadoop (CCAH) Download Full Version : http://killexams.com/pass4sure/exam-detail/cca-410 Reference: CONFIGURATION PARAMETERS DFS.BLOCK.SIZE
More informationFLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM
FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM RECOMMENDATION AND JUSTIFACTION Executive Summary: VHB has been tasked by the Florida Department of Transportation District Five to design
More informationBig Data 7. Resource Management
Ghislain Fourny Big Data 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models Syntax Encoding Storage
More informationThe ATLAS EventIndex: Full chain deployment and first operation
The ATLAS EventIndex: Full chain deployment and first operation Álvaro Fernández Casaní Instituto de Física Corpuscular () Universitat de València CSIC On behalf of the ATLAS Collaboration 1 Outline ATLAS
More informationA brief history on Hadoop
Hadoop Basics A brief history on Hadoop 2003 - Google launches project Nutch to handle billions of searches and indexing millions of web pages. Oct 2003 - Google releases papers with GFS (Google File System)
More informationBig Data for Engineers Spring Resource Management
Ghislain Fourny Big Data for Engineers Spring 2018 7. Resource Management artjazz / 123RF Stock Photo Data Technology Stack User interfaces Querying Data stores Indexing Processing Validation Data models
More informationA Review Paper on Big data & Hadoop
A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College
More informationDivide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io
1 Divide & Recombine with Tessera: Analyzing Larger and More Complex Data tessera.io The D&R Framework Computationally, this is a very simple. 2 Division a division method specified by the analyst divides
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationTuning Intelligent Data Lake Performance
Tuning Intelligent Data Lake Performance 2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without
More informationMixApart: Decoupled Analytics for Shared Storage Systems. Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp
MixApart: Decoupled Analytics for Shared Storage Systems Madalin Mihailescu, Gokul Soundararajan, Cristiana Amza University of Toronto and NetApp Hadoop Pig, Hive Hadoop + Enterprise storage?! Shared storage
More informationAutor Position 10. March 2010 Dr. Dragan Milosevic
Titel Product der Folie Search and Reporting Datum powered zanox Group by Hadoop Autor Position 10. March 2010 Dr. Dragan Milosevic Who am I? Senior Architect at zanox AG Over the last two years I have
More informationWSDC Hardware Architecture
WSDC Hardware Architecture Tim Conrow, Lead Engineer Heidi Brandenburg IPAC/Caltech HB 1 Overview Hardware System Architecture as presented at the Critical Design Review RFA from CDR board Additional tasks
More information10 Million Smart Meter Data with Apache HBase
10 Million Smart Meter Data with Apache HBase 5/31/2017 OSS Solution Center Hitachi, Ltd. Masahiro Ito OSS Summit Japan 2017 Who am I? Masahiro Ito ( 伊藤雅博 ) Software Engineer at Hitachi, Ltd. Focus on
More informationTHE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA
THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA Sara Nieto on behalf of B.Altieri, G.Buenadicha, J. Salgado, P. de Teodoro European Space Astronomy Center, European Space Agency, Spain O.R.
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationLenovo Database Configuration
Lenovo Database Configuration for Microsoft SQL Server Standard Edition DWFT 9TB Reduce time to value with pretested hardware configurations Data Warehouse problem and a solution The rapid growth of technology
More informationBIG DATA TESTING: A UNIFIED VIEW
http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation
More informationA web portal to analyze and distribute cosmology data
on Hadoop https://cosmohub.pic.es A web portal to analyze and distribute cosmology data J.Carretero, P.Tallada, J.Casals, M.Caubet, C.Neissner, N.Tonello, J.Delgado, F.Torradeflot, M.Delfino, S.Serrano,
More informationAchieving Horizontal Scalability. Alain Houf Sales Engineer
Achieving Horizontal Scalability Alain Houf Sales Engineer Scale Matters InterSystems IRIS Database Platform lets you: Scale up and scale out Scale users and scale data Mix and match a variety of approaches
More informationCopyright 2012, Oracle and/or its affiliates. All rights reserved.
1 Big Data Connectors: High Performance Integration for Hadoop and Oracle Database Melli Annamalai Sue Mavris Rob Abbott 2 Program Agenda Big Data Connectors: Brief Overview Connecting Hadoop with Oracle
More informationEnabling the Smart Grid through Big Data
Enabling the Smart Grid through Big Data Paul A. Navrá;l, Ph.D. Manager Scalable Visualiza;on Technologies Texas Advanced Compu;ng Center TACC Booth @ SC12 November 14, 2012 The Age of Big Data Records
More informationBringing Data to Life
Bringing Data to Life Data management and Visualization Techniques Benika Hall Rob Harrison Corporate Model Risk March 16, 2018 Introduction Benika Hall Analytic Consultant Wells Fargo - Corporate Model
More informationWas ist dran an einer spezialisierten Data Warehousing platform?
Was ist dran an einer spezialisierten Data Warehousing platform? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Data warehousing, Exadata, specialized hardware proprietary hardware Introduction
More informationOpenManage Power Center Demo Guide for https://demos.dell.com
OpenManage Power Center Demo Guide for https://demos.dell.com Contents Introduction... 3 Lab 1 Demo Environment... 6 Lab 2 Change the default settings... 7 Lab 3 Discover the devices... 8 Lab 4 Group Creation
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationUNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in
More informationBest Practices and Performance Tuning on Amazon Elastic MapReduce
Best Practices and Performance Tuning on Amazon Elastic MapReduce Michael Hanisch Solutions Architect Amo Abeyaratne Big Data and Analytics Consultant ANZ 12.04.2016 2016, Amazon Web Services, Inc. or
More informationImage Processing on the Cloud. Outline
Mars Science Laboratory! Image Processing on the Cloud Emily Law Cloud Computing Workshop ESIP 2012 Summer Meeting July 14 th, 2012 1/26/12! 1 Outline Cloud computing @ JPL SDS Lunar images Challenge Image
More informationAbstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight
ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group
More informationThe Euclid Ground Segment Design for File-based Operations
The Euclid Ground Segment Design for File-based Operations Frank Keck, Felix Flentge, Colin Haddow, Guillermo Buenadicha 14/03/2017 ESA UNCLASSIFIED - Releasable to the Public 2017 by European Space Agency.
More informationTHE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA
THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA Rees Williams on behalf of A.N.Belikov, D.Boxhoorn, B. Dröge, J.McFarland, A.Tsyganov, E.A. Valentijn University of Groningen, Groningen,
More informationAccelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016
Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016 Nikita Ivanov CTO and Co-Founder GridGain Systems Peter Zaitsev CEO and Co-Founder Percona About the Presentation
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More informationAdvanced Database Technologies NoSQL: Not only SQL
Advanced Database Technologies NoSQL: Not only SQL Christian Grün Database & Information Systems Group NoSQL Introduction 30, 40 years history of well-established database technology all in vain? Not at
More informationCC-IN2P3 / NCSA Meeting May 27-28th,2015
The IN2P3 LSST Computing Effort Dominique Boutigny (CNRS/IN2P3 and SLAC) on behalf of the IN2P3 Computing Team CC-IN2P3 / NCSA Meeting May 27-28th,2015 OSG All Hands SLAC April 7-9, 2014 1 LSST Computing
More informationProject Genesis. Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth
Scaling with HiveDB Project Genesis Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth Enter Jeremy and HiveDB Our Requirements OLTP
More informationTable 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti
Solution Overview Cisco UCS Integrated Infrastructure for Big Data with the Elastic Stack Cisco and Elastic deliver a powerful, scalable, and programmable IT operations and security analytics platform
More informationEvolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo
Evolution of Big Data Architectures@ Facebook Architecture Summit, Shenzhen, August 2012 Ashish Thusoo About Me Currently Co-founder/CEO of Qubole Ran the Data Infrastructure Team at Facebook till 2011
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationRoadmap: Operating Pentaho at Scale. Jens Bleuel Senior Product Manager, Pentaho
Roadmap: Operating Pentaho at Scale Jens Bleuel Senior Product Manager, Pentaho Agenda Worker Nodes Hear about new upcoming capabilities for scaling out the Pentaho platform in large enterprise operations.
More informationApache Hadoop 3. Balazs Gaspar Sales Engineer CEE & CIS Cloudera, Inc. All rights reserved.
Apache Hadoop 3 Balazs Gaspar Sales Engineer CEE & CIS balazs@cloudera.com 1 We believe data can make what is impossible today, possible tomorrow 2 We empower people to transform complex data into clear
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationWhat is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?
Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation
More informationTime series for European Space Agency Solar Orbiter Archive with TimescaleDB
Time series for European Space Agency Solar Orbiter Archive with TimescaleDB Hector Perez, European Space Astronomy Centre, Madrid, Spain David Kohn, Timescale, NYC, US PGCONF.US 2018, 04/18/2018 ESA UNCLASSIFIED
More informationImplementation of a Middleware Based Ground System March 2, 2005, GSAW2005 Conference
Implementation of a Middleware Based Ground System March 2, 2005, GSAW2005 Conference Presented By Everett Cary Emergent Space Technologies, Inc. Teammates NASA GMSEC NASA SSMO Honeywell Technology Solutions,
More informationEmbedded Technosolutions
Hadoop Big Data An Important technology in IT Sector Hadoop - Big Data Oerie 90% of the worlds data was generated in the last few years. Due to the advent of new technologies, devices, and communication
More informationMOHA: Many-Task Computing Framework on Hadoop
Apache: Big Data North America 2017 @ Miami MOHA: Many-Task Computing Framework on Hadoop Soonwook Hwang Korea Institute of Science and Technology Information May 18, 2017 Table of Contents Introduction
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1 HW8 is out Last assignment! Get Amazon credits now (see instructions) Spark with Hadoop Due next wed CSE 344 - Fall 2016
More informationHDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung
HDFS: Hadoop Distributed File System CIS 612 Sunnie Chung What is Big Data?? Bulk Amount Unstructured Introduction Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per
More informationVxRail: Level Up with New Capabilities and Powers GLOBAL SPONSORS
VxRail: Level Up with New Capabilities and Powers GLOBAL SPONSORS VMware customers trust their infrastructure to vsan #1 Leading SDS Vendor >10,000 >100 83% vsan Customers Countries Deployed Critical Apps
More informationEsgynDB Enterprise 2.0 Platform Reference Architecture
EsgynDB Enterprise 2.0 Platform Reference Architecture This document outlines a Platform Reference Architecture for EsgynDB Enterprise, built on Apache Trafodion (Incubating) implementation with licensed
More informationMapReduce, Hadoop and Spark. Bompotas Agorakis
MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)
More informationA Glimpse of the Hadoop Echosystem
A Glimpse of the Hadoop Echosystem 1 Hadoop Echosystem A cluster is shared among several users in an organization Different services HDFS and MapReduce provide the lower layers of the infrastructures Other
More informationThe CEDA Archive: Data, Services and Infrastructure
The CEDA Archive: Data, Services and Infrastructure Kevin Marsh Centre for Environmental Data Archival (CEDA) www.ceda.ac.uk with thanks to V. Bennett, P. Kershaw, S. Donegan and the rest of the CEDA Team
More informationMagellan Project. Jeff Broughton NERSC Systems Department Head October 7, 2009
Magellan Project Jeff Broughton NERSC Systems Department Head October 7, 2009 1 Magellan Background National Energy Research Scientific Computing Center (NERSC) Argonne Leadership Computing Facility (ALCF)
More informationTwo Success Stories - Optimised Real-Time Reporting with BI Apps
Oracle Business Intelligence 11g Two Success Stories - Optimised Real-Time Reporting with BI Apps Antony Heljula October 2013 Peak Indicators Limited 2 Two Success Stories - Optimised Real-Time Reporting
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later
More informationTuning Intelligent Data Lake Performance
Tuning Intelligent Data Lake 10.1.1 Performance Copyright Informatica LLC 2017. Informatica, the Informatica logo, Intelligent Data Lake, Big Data Mangement, and Live Data Map are trademarks or registered
More informationEuclid Archive Science Archive System
Euclid Archive Science Archive System Bruno Altieri Sara Nieto, Pilar de Teodoro (ESDC) 23/09/2016 Euclid Archive System Overview The EAS Data Processing System (DPS) stores the data products metadata
More informationResource and Performance Distribution Prediction for Large Scale Analytics Queries
Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationSolutions for Netezza Performance Issues
Solutions for Netezza Performance Issues Vamsi Krishna Parvathaneni Tata Consultancy Services Netezza Architect Netherlands vamsi.parvathaneni@tcs.com Lata Walekar Tata Consultancy Services IBM SW ATU
More informationApproaching the Petabyte Analytic Database: What I learned
Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may
More informationReal Time for Big Data: The Next Age of Data Management. Talksum, Inc. Talksum, Inc. 582 Market Street, Suite 1902, San Francisco, CA 94104
Real Time for Big Data: The Next Age of Data Management Talksum, Inc. Talksum, Inc. 582 Market Street, Suite 1902, San Francisco, CA 94104 Real Time for Big Data The Next Age of Data Management Introduction
More informationAbstract. 1. Introduction
Grid Enabled Service Infrastructure (GESI) Isaac Christoffersen, Christopher Dale, Doug Johnson, David Schillero, Booz Allen Hamilton christoffersen_isaac@bah.com, dale_christopher@bah.com, johnson_doug@bah.com,
More informationHigh-Performance Distributed DBMS for Analytics
1 High-Performance Distributed DBMS for Analytics 2 About me Developer, hardware engineering background Head of Analytic Products Department in Yandex jkee@yandex-team.ru 3 About Yandex One of the largest
More informationUsing GPUaaS in Cloud Foundry
Using GPUaaS in Cloud Foundry Agenda Introduction GPUaaS Cloud Foundry Integration 2 Technology Research Innovation Group Innovation Advanced Research Proof of Concept User Feedback Agile Roadmap 3 Technology
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationAccelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017
Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017 About the Presentation Problems Existing Solutions Denis Magda
More informationData Intensive processing with irods and the middleware CiGri for the Whisper project Xavier Briand
and the middleware CiGri for the Whisper project Use Case of Data-Intensive processing with irods Collaboration between: IT part of Whisper: Sofware development, computation () Platform Ciment: IT infrastructure
More informationPart 1: Indexes for Big Data
JethroData Making Interactive BI for Big Data a Reality Technical White Paper This white paper explains how JethroData can help you achieve a truly interactive interactive response time for BI on big data,
More informationMIXPANEL SYSTEM ARCHITECTURE
MIXPANEL SYSTEM ARCHITECTURE Vijay Jayaram, Technical Lead Manager, Mixpanel Infrastructure The content herein is correct as of June 2018, and represents the status quo at the time it was written. Mixpanel
More informationExam Questions
Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure
More informationHDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1
HDFS Federation Sanjay Radia Founder and Architect @ Hortonworks Page 1 About Me Apache Hadoop Committer and Member of Hadoop PMC Architect of core-hadoop @ Yahoo - Focusing on HDFS, MapReduce scheduler,
More informationProcessing 11 billions events a day with Spark. Alexander Krasheninnikov
Processing 11 billions events a day with Spark Alexander Krasheninnikov Badoo facts 46 languages 10M Photos added daily 320M registered users 190 countries 21M daily active users 3000+ servers 2 data-centers
More informationSCALABLE TRAJECTORY DESIGN WITH COTS SOFTWARE. x8534, x8505,
SCALABLE TRAJECTORY DESIGN WITH COTS SOFTWARE Kenneth Kawahara (1) and Jonathan Lowe (2) (1) Analytical Graphics, Inc., 6404 Ivy Lane, Suite 810, Greenbelt, MD 20770, (240) 764 1500 x8534, kkawahara@agi.com
More informationHighQSoft GmbH Big Data ODS. Setting up of a prototype
Big Data ODS Setting up of a prototype 1 Performance und Scalability Topics 1. Why Big Data? 2. General Overview 3. HighQSoft Approach 4. Summary 2 What is the ODS 6.0 Proposal? Overview ODS API Definition
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationCatalogic DPX TM 4.3. ECX 2.0 Best Practices for Deployment and Cataloging
Catalogic DPX TM 4.3 ECX 2.0 Best Practices for Deployment and Cataloging 1 Catalogic Software, Inc TM, 2015. All rights reserved. This publication contains proprietary and confidential material, and is
More informationBig Data Facebook
Big Data Architectures@ Facebook QCon London 2012 Ashish Thusoo Outline Big Data @ Facebook - Scope & Scale Evolution of Big Data Architectures @ FB Past, Present and Future Questions Big Data @ FB: Scale
More informationCS60021: Scalable Data Mining. Sourangshu Bhattacharya
CS60021: Scalable Data Mining Sourangshu Bhattacharya In this Lecture: Outline: HDFS Motivation HDFS User commands HDFS System architecture HDFS Implementation details Sourangshu Bhattacharya Computer
More informationAccelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet
WHITE PAPER Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet Contents Background... 2 The MapR Distribution... 2 Mellanox Ethernet Solution... 3 Test
More informationLenovo Database Configuration
Lenovo Database Configuration for Microsoft SQL Server OLTP on Flex System with DS6200 Reduce time to value with pretested hardware configurations - 20TB Database and 3 Million TPM OLTP problem and a solution
More informationAnalytics Platform for ATLAS Computing Services
Analytics Platform for ATLAS Computing Services Ilija Vukotic for the ATLAS collaboration ICHEP 2016, Chicago, USA Getting the most from distributed resources What we want To understand the system To understand
More informationScheduling Applications at Scale
Scheduling Applications at Scale Meeting Tomorrow's Application Needs, Today http://1stchoicesportsrehab.com/wp-content/uploads/2012/05/calendar.jpg SETH VARGO @sethvargo Globally Distributed Optimistically
More informationCloud Computing Capacity Planning
Cloud Computing Capacity Planning Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Introduction One promise of cloud computing is that virtualization
More informationMassive Online Analysis - Storm,Spark
Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R
More informationAn Insider s Guide to Oracle Autonomous Transaction Processing
An Insider s Guide to Oracle Autonomous Transaction Processing Maria Colgan Master Product Manager Troy Anthony Senior Director, Product Management #thinkautonomous Autonomous Database Traditionally each
More informationLenovo Database Configuration for Microsoft SQL Server TB
Database Lenovo Database Configuration for Microsoft SQL Server 2016 22TB Data Warehouse Fast Track Solution Data Warehouse problem and a solution The rapid growth of technology means that the amount of
More informationexam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0
70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to
More informationMaking the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor
Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,
More information