Large Scale Data Management of Astronomical Surveys with AstroSpark

Size: px
Start display at page:

Download "Large Scale Data Management of Astronomical Surveys with AstroSpark"

Transcription

1 Large Scale Data Management of Astronomical Surveys with AstroSpark Mariem BRAHEM 1,2, Karine ZEITOUNI 1, Laurent YEH 1 (1) DAVID Lab University of Versailles (2) CNES Centre National d Etudes Spatiale - Toulouse XLDB th Extremely Large Databases Conference Clermont-Ferrand, October 2017

2 Context & Motivation Big Data in the field of Cosmology 3D map of our Galaxy 1 billion stars observed >1PB Dec >10 billions of objects > 15PB for the catalog > 2020 Gaia mission LSST Project Ø Astronomers need an efficient solution for large scale astronomical data handling. AstroSpark - XLDB

3 Contributions We propose new algorithms and a framework AstroSpark Ø which extends Apache Spark, a distributed in-memory computing engine to process and analyze astronomical data ² Implements operators such as Cone-Search, Cross-Match, knn, Ø offers an expressive programming interface by supporting the unified query language ADQL Ø combines data partitioning and indexing with HEALPIX, a pixelization of data on the sphere, to speed up query processing. ² e.g., HX-Match uses HEALPIX to speed-up Cross-Match Ø implements a query optimizer and provides a set of customized strategies for astronomical queries. AstroSpark - XLDB

4 AstroSpark Architecture Input Data Querying system Astronomical Data Query Language (ADQL) SELECT * FROM gaia JOIN tycho2 ON 1=CONTAINS ( POINT( ICRS, gaia.ra, gaia.dec), CIRCLE( ICRS, tycho2.ra, tycho2.dec, 2/3600)) Data Partitioning Query Language (ADQL) Query Parser Query Optimizer (extended Catalyst) Int. Virtual Observatory Alliance IVOA Healpix library Storage (HDFS) SPARK Core AstroSpark - XLDB

5 Healpix Based Data Par00oning ü Locality: Close data points are likely to be in the same partition & all points withing a Healpix pixel belong to the same partition ü Balance: Partitions have roughly the same size and adapt to data density ü Each node can process many partitions Range = Range = Range = Node 1 Node 2 Node 3 AstroSpark - XLDB 2017 Visualization under ALADIN A tool provided by 5

6 Cross-Matching Example Iden0fy and correlate objects belonging to different observa0ons Could be expressed in Spark SQL : SELECT * FROM R JOIN S ON (2*ASIN(SQRT(SIN((DEC2 - DEC)/2) * SIN((DEC2 - DEC)/2) + COS(DEC2) * COS(DEC) * SIN((RA2 - RA)/ 2) * SIN((RA2 RA)/2))) <= ɛ) But untractable: Cross-matching only 200,000 records of Gaia and Tycho-2 takes 13,6 hours, and more than 12 days for 5 million objects in Gaia and tycho-2! ɛ 6 R S

7 Cross-Matching in AstroSpark Ø HX-Match leverages the space indexing & partitioning along with the HEALPIX NASA Library to guide the data access and limit the pairwise distance computation. Ø Substitutes the costly cartesian product by an Equi-join + Filter Query Plan in Spark SQL Query Plan in AstroSpark 7

8 Results of cross-matching: GAIA DR1 TYCHO2 Logarithmic scale Logarithmic scale AstroSpark - XLDB Gain of the partition materialization

9 Cone Search & knn Search Evalua0on on GAIA DR1 Effect of varying data size on Cone Search 9

10 First Results & Impacts Validated for 3 operators & ADQL queries on real datasets Experiments have shown that AstroSpark is effective in processing astronomical data, scalable and overperforms the state-of-the-art solutions. Publications: Ø Ø Ø Ø M. Brahem, K. Zeitouni & L. Yeh, HX-MATCH: In-Memory Cross- Matching Algorithm for Astronomical Big Data, International Symposium on Spatial and Temporal Databases (SSTD 2017). M. Brahem, K. Zeitouni & L. Yeh, Large Scale Data Management of Astronomical Surveys with AstroSpark, Conference on Big Data from Space (BiDS 2017) K. Zeitouni, M. Brahem & L. Yeh, Large Scale Data Management of Astronomical Surveys with AstroSpark, European Week of Astronomy and Space Science (EWASS 2017) M. Brahem, K. Zeitouni & L. Yeh, AstroSpark: towards a distributed data server for big data in astronomy. Proceedings of the 3 rd ACM SIGSPATIAL PhD Symposium. ACM 2016 AstroSpark - XLDB

VIRTUAL OBSERVATORY TECHNOLOGIES

VIRTUAL OBSERVATORY TECHNOLOGIES VIRTUAL OBSERVATORY TECHNOLOGIES / The Johns Hopkins University Moore s Law, Big Data! 2 Outline 3 SQL for Big Data Computing where the bytes are Database and GPU integration CUDA from SQL Data intensive

More information

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton

CDS. André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 de Données astronomiques de Strasbourg, 2SSC-XMM-Newton Docker @ CDS André Schaaff1, François-Xavier Pineau1, Gilles Landais1, Laurent Michel2 1Centre de Données astronomiques de Strasbourg, 2SSC-XMM-Newton Paul Trehiou Université de technologie de Belfort-Montbéliard

More information

Tutorial "Gaia in the CDS services" Gaia data Heidelberg June 19, 2018 Sébastien Derriere (adapted from Thomas Boch)

Tutorial Gaia in the CDS services Gaia data Heidelberg June 19, 2018 Sébastien Derriere (adapted from Thomas Boch) Tutorial "Gaia in the CDS services" Gaia data workshop @ Heidelberg June 19, 2018 Sébastien Derriere (adapted from Thomas Boch) Each section (numbered 1. to 6.) can be done independently. 1. Explore Gaia

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Virtual Observatory publication of interferometry simulations

Virtual Observatory publication of interferometry simulations Virtual Observatory publication of interferometry simulations Anita Richards, Paul Harrison JBCA, University of Manchester Francois Levrier LRA, ENS Paris Nicholas Walton, Eduardo Gonzalez-Solarez IoA,

More information

THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA

THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA Sara Nieto on behalf of B.Altieri, G.Buenadicha, J. Salgado, P. de Teodoro European Space Astronomy Center, European Space Agency, Spain O.R.

More information

CC-IN2P3 / NCSA Meeting May 27-28th,2015

CC-IN2P3 / NCSA Meeting May 27-28th,2015 The IN2P3 LSST Computing Effort Dominique Boutigny (CNRS/IN2P3 and SLAC) on behalf of the IN2P3 Computing Team CC-IN2P3 / NCSA Meeting May 27-28th,2015 OSG All Hands SLAC April 7-9, 2014 1 LSST Computing

More information

Visualization of SDSS III (BOSS) Cosmology Data

Visualization of SDSS III (BOSS) Cosmology Data Visualization of SDSS III (BOSS) Cosmology Data Nazmus Saquib, University of Utah, Salt Lake City, UT 84102 December 16, 2011 Abstract Data collected over three years in the SDSS-III cosmology project

More information

Theme 7 Group 2 Data mining technologies Catalogues crossmatching on distributed database and application on MWA absorption source finding

Theme 7 Group 2 Data mining technologies Catalogues crossmatching on distributed database and application on MWA absorption source finding Theme 7 Group 2 Data mining technologies Catalogues crossmatching on distributed database and application on MWA absorption source finding Crossmatching is a method to find corresponding objects in different

More information

THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA

THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA Rees Williams on behalf of A.N.Belikov, D.Boxhoorn, B. Dröge, J.McFarland, A.Tsyganov, E.A. Valentijn University of Groningen, Groningen,

More information

Europlanet IDIS: Adapting existing VO building blocks to Planetary Sciences

Europlanet IDIS: Adapting existing VO building blocks to Planetary Sciences Europlanet IDIS: Adapting existing VO building blocks to Planetary Sciences B. Cecconi, LESIA, Observatoire de Paris, France Cospar-2012, Mysore EPN/IDIS Building a planetary VO prototype VO = Virtual

More information

The Ef'iciency of Spatial Indexing Methods Applied to Large Astronomical Databases

The Ef'iciency of Spatial Indexing Methods Applied to Large Astronomical Databases The Ef'iciency of Spatial Indexing Methods Applied to Large Astronomical Databases G. B. Berriman and J. C. Good Caltech/IPAC, Mail Stop 100-22, Pasadena, CA 91125 B. Shiao and T. Donaldson Space Telescope

More information

Euclid Archive Science Archive System

Euclid Archive Science Archive System Euclid Archive Science Archive System Bruno Altieri Sara Nieto, Pilar de Teodoro (ESDC) 23/09/2016 Euclid Archive System Overview The EAS Data Processing System (DPS) stores the data products metadata

More information

PROCESSING THE GAIA DATA IN CNES: THE GREAT ADVENTURE INTO HADOOP WORLD

PROCESSING THE GAIA DATA IN CNES: THE GREAT ADVENTURE INTO HADOOP WORLD CHAOUL Laurence, VALETTE Véronique CNES, Toulouse PROCESSING THE GAIA DATA IN CNES: THE GREAT ADVENTURE INTO HADOOP WORLD BIDS 16, March 15-17th 2016 THE GAIA MISSION AND DPAC ARCHITECTURE AGENDA THE DPCC

More information

The Canadian CyberSKA Project

The Canadian CyberSKA Project The Canadian CyberSKA Project A. G. Willis (on behalf of the CyberSKA Project Team) National Research Council of Canada Herzberg Institute of Astrophysics Dominion Radio Astrophysical Observatory May 24,

More information

The IPAC Research Archives. Steve Groom IPAC / Caltech

The IPAC Research Archives. Steve Groom IPAC / Caltech The IPAC Research Archives Steve Groom IPAC / Caltech IPAC overview The Infrared Processing and Analysis Center (IPAC) at Caltech is dedicated to science operations, data archives, and community support

More information

Versatile access to HEALPix based sky region objects within PostgreSQL data bases with PgSphere

Versatile access to HEALPix based sky region objects within PostgreSQL data bases with PgSphere Versatile access to HEALPix based sky region objects within PostgreSQL data bases with PgSphere Markus Nullmeier Zentrum für Astronomie der Universität Heidelberg Astronomisches Rechen Institut mnullmei@ari.uni.heidelberg.de

More information

SDSS Dataset and SkyServer Workloads

SDSS Dataset and SkyServer Workloads SDSS Dataset and SkyServer Workloads Overview Understanding the SDSS dataset composition and typical usage patterns is important for identifying strategies to optimize the performance of the AstroPortal

More information

Performance-related aspects in the Big Data Astronomy Era: architects in software optimization

Performance-related aspects in the Big Data Astronomy Era: architects in software optimization Performance-related aspects in the Big Data Astronomy Era: architects in software optimization Daniele Tavagnacco - INAF-Observatory of Trieste on behalf of EUCLID SDC-IT Design and Optimization image

More information

CDS X-match service API

CDS X-match service API CDS X-match service API François-Xavier Pineau 1, Thomas Boch 1 1 CDS, Observatoire Astronomique de Strasbourg IVOA Interop, Heidelberg François-Xavier Pineau (CDS) CDS X-match API 14/05/2013 1 / 9 Intro

More information

OUZO for indexing sets

OUZO for indexing sets OUZO for indexing sets Accelerating queries to sets with GIN, GiST, and custom indexing extensions Markus Nullmeier Zentrum für Astronomie der Universität Heidelberg Astronomisches Rechen-Institut mnullmei@ari.uni.heidelberg.de

More information

Designing the Future Data Management Environment for [Radio] Astronomy. JJ Kavelaars Canadian Astronomy Data Centre

Designing the Future Data Management Environment for [Radio] Astronomy. JJ Kavelaars Canadian Astronomy Data Centre Designing the Future Data Management Environment for [Radio] Astronomy JJ Kavelaars Canadian Astronomy Data Centre 2 Started working in Radio Data Archiving as Graduate student at Queen s in 1993 Canadian

More information

Technological Challenges in the GAIA Archive

Technological Challenges in the GAIA Archive Technological Challenges in the GAIA Archive Juan Gonzalez jgonzale at sciops.esa.int Jesus Salgado jsalgado at sciops.esa.int ESA Science Archives Team IVOA Interop 2013, Heidelberg May 2013 Presentation

More information

Accelerating queries of set data types with GIN, GiST, and custom indexing extensions

Accelerating queries of set data types with GIN, GiST, and custom indexing extensions Accelerating queries of set data types with GIN, GiST, and custom indexing extensions Markus Nullmeier Zentrum für Astronomie der Universität Heidelberg Astronomisches Rechen-Institut mnullmei@ari.uni.heidelberg.de

More information

A scalability comparison study of data management approaches for smart metering systems

A scalability comparison study of data management approaches for smart metering systems A scalability comparison study of data management approaches for smart metering systems Houssem Chihoub, Chris.ne Collet Grenoble INP houssem.chihoub@imag.fr Journées Plateformes Clermont Ferrand 6-7 octobre

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay

Submitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunnie Chung Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunny Chung Presented by: Sonal Deshmukh Jay Upadhyay What is Apache Survey shows huge popularity spike for Apache

More information

Introduction to Relational Databases

Introduction to Relational Databases Introduction to Relational Databases Third La Serena School for Data Science: Applied Tools for Astronomy August 2015 Mauro San Martín msmartin@userena.cl Universidad de La Serena Contents Introduction

More information

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Blended Learning Outline: Cloudera Data Analyst Training (171219a) Blended Learning Outline: Cloudera Data Analyst Training (171219a) Cloudera Univeristy s data analyst training course will teach you to apply traditional data analytics and business intelligence skills

More information

Focus Session on Multi-dimensional Data

Focus Session on Multi-dimensional Data Focus Session on Multi-dimensional Data Introduction Mark Allen, Joe Lazio IVOA Interoperability Meeting, ESAC, Madrid, May 20, 2014 CoSADIE Project Science Priority Areas Multi-dimensional Data image:

More information

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Page 1 of 5 1 Year 1 Proposal Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Year 1 Progress Report & Year 2 Proposal In order to setup the context for this progress

More information

Exploiting Virtual Observatory and Information Technology: Techniques for Astronomy

Exploiting Virtual Observatory and Information Technology: Techniques for Astronomy Exploiting Virtual Observatory and Information Technology: Techniques for Astronomy Nicholas Walton AstroGrid Project Scientist Institute of Astronomy, The University of Cambridge Lecture #3 Goal: Applications

More information

HiPS Hierarchical Progressive Survey

HiPS Hierarchical Progressive Survey International Virtual Observatory Alliance HiPS Hierarchical Progressive Survey Version 1.0 IVOA Note 15 th October 2015 Previous version(s): None Authors: Pierre Fernique [CDS] Mark Allen [CDS] Thomas

More information

ACM MM Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang

ACM MM Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang ACM MM 2010 Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang Harbin Institute of Technology National University of Singapore Microsoft Corporation Proliferation of images and videos on the Internet

More information

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018 NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE DACE https://dace.unige.ch Data and Analysis Center for Exoplanets. Facility to store, exchange and analyse data

More information

ESA Science Archives Architecture Evolution. Iñaki Ortiz de Landaluce Science Archives Team 13 th Sept 2013

ESA Science Archives Architecture Evolution. Iñaki Ortiz de Landaluce Science Archives Team 13 th Sept 2013 ESA Science Archives Architecture Evolution Iñaki Ortiz de Landaluce Science Archives Team 13 th Sept 2013 Outline Introduction: ESA Science Archives Archives Architecture Evolution User Interfaces and

More information

Technology for the Virtual Observatory. The Virtual Observatory. Toward a new astronomy. Toward a new astronomy

Technology for the Virtual Observatory. The Virtual Observatory. Toward a new astronomy. Toward a new astronomy Technology for the Virtual Observatory BRAVO Lecture Series, INPE, Brazil July 23-26, 2007 1. Virtual Observatory Summary 2. Service Architecture and XML 3. Building and Using Services 4. Advanced Services

More information

Long-term management of 1000s of All-Sky reference data sets using the HiPS network

Long-term management of 1000s of All-Sky reference data sets using the HiPS network Long-term management of 1000s of All-Sky reference data sets using the HiPS network ADASS October 2016 - Trieste Présenté par P.Fernique, T.Boch, A. Oberto, M. Allen, D. Durand, K. Ebisawa, B. Merin, J.

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

Malaga workshop, May 2009

Malaga workshop, May 2009 DB multi-depth sky pixelization customizing MySQL with HEALPix and HTM - II Luciano Nicastro 1 & Giorgio Calderone 2 INAF-IASF, 1 Bologna, 2 Palermo Malaga workshop, 18-21 May 2009 Summary Introduction

More information

P Structured Query Language for Virtual Observatory

P Structured Query Language for Virtual Observatory P1.1.23 Structured Query Language for Virtual Observatory Yuji Shirasaki National Astronomical Observatory of Japan, and Masahiro Tanaka (NAOJ), Satoshi Honda (NAOJ), Yoshihiko Mizumoto (NAOJ), Masatoshi

More information

New Trends in Database Systems

New Trends in Database Systems New Trends in Database Systems Ahmed Eldawy 9/29/2016 1 Spatial and Spatio-temporal data 9/29/2016 2 What is spatial data Geographical data Medical images 9/29/2016 Astronomical data Trajectories 3 Application

More information

Building on Existing Communities: the Virtual Astronomical Observatory (and NIST)

Building on Existing Communities: the Virtual Astronomical Observatory (and NIST) Building on Existing Communities: the Virtual Astronomical Observatory (and NIST) Robert Hanisch Space Telescope Science Institute Director, Virtual Astronomical Observatory Data in astronomy 2 ~70 major

More information

What is Data Warehouse like

What is Data Warehouse like What is Data Warehouse like in the Big Data Era? Sales (Asia) Data Warehouse Sales (US) ETL ETL Collects and organizes historical data from multiple sources Inventory Advertising ETL ETL So far Ø Star

More information

Error indexing of vertices in spatial database -- A case study of OpenStreetMap data

Error indexing of vertices in spatial database -- A case study of OpenStreetMap data Error indexing of vertices in spatial database -- A case study of OpenStreetMap data Xinlin Qian, Kunwang Tao and Liang Wang Chinese Academy of Surveying and Mapping 28 Lianhuachixi Road, Haidian district,

More information

Reviving and extending Pgsphere

Reviving and extending Pgsphere Reviving and extending Pgsphere Markus Nullmeier Zentrum für Astronomie der Universität Heidelberg Astronomisches Rechen Institut mnullmei@ari.uni.heidelberg.de Reviving and extending Pgsphere Markus Nullmeier

More information

Extending the SDSS Batch Query System to the National Virtual Observatory Grid

Extending the SDSS Batch Query System to the National Virtual Observatory Grid Extending the SDSS Batch Query System to the National Virtual Observatory Grid María A. Nieto-Santisteban, William O'Mullane Nolan Li Tamás Budavári Alexander S. Szalay Aniruddha R. Thakar Johns Hopkins

More information

The NOAO Data Lab Design, Capabilities and Community Development. Michael Fitzpatrick for the Data Lab Team

The NOAO Data Lab Design, Capabilities and Community Development. Michael Fitzpatrick for the Data Lab Team The NOAO Data Lab Design, Capabilities and Community Development Michael Fitzpatrick for the Data Lab Team What is it? Data Lab is Science Exploration Platform that provides:! Repository for large datasets

More information

Design and Implementation of the Japanese Virtual Observatory (JVO) system Yuji SHIRASAKI National Astronomical Observatory of Japan

Design and Implementation of the Japanese Virtual Observatory (JVO) system Yuji SHIRASAKI National Astronomical Observatory of Japan Design and Implementation of the Japanese Virtual Observatory (JVO) system Yuji SHIRASAKI National Astronomical Observatory of Japan 1 Introduction What can you do on Japanese Virtual Observatory (JVO)?

More information

Populating the Galaxy Zoo

Populating the Galaxy Zoo Populating the Galaxy Zoo Real-time Image Classification with SQL Server R Services David M Smith @revodavid R Community Lead Microsoft Algorithms and Data Science THANKS to all Sponsors! EVENT SPONSORS

More information

BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite

BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite BigDataBench- S: An Open- source Scien6fic Big Data Benchmark Suite Xinhui Tian, Shaopeng Dai, Zhihui Du, Wanling Gao, Rui Ren, Yaodong Cheng, Zhifei Zhang, Zhen Jia, Peijian Wang and Jianfeng Zhan INSTITUTE

More information

Fishing Activity Visualization with Free Software Bigdata Analytics Institute

Fishing Activity Visualization with Free Software Bigdata Analytics Institute Fishing Activity Visualization with Free Software Bigdata Analytics Institute Erico N de Souza, PhD erico.souza@dal.ca Souza, Latouf (Bigdata Inst.) Bigdata Institute 1 / 22 Introduction What would you

More information

The Virtual Observatory and the IVOA

The Virtual Observatory and the IVOA The Virtual Observatory and the IVOA The Virtual Observatory Emergence of the Virtual Observatory concept by 2000 Concerns about the data avalanche, with in mind in particular very large surveys such as

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

The Portal Aspect of the LSST Science Platform. Gregory Dubois-Felsmann Caltech/IPAC. LSST2017 August 16, 2017

The Portal Aspect of the LSST Science Platform. Gregory Dubois-Felsmann Caltech/IPAC. LSST2017 August 16, 2017 The Portal Aspect of the LSST Science Platform Gregory Dubois-Felsmann Caltech/IPAC LSST2017 August 16, 2017 1 Purpose of the LSST Science Platform (LSP) Enable access to the LSST data products Enable

More information

Apache Spark: A Literature Review. Presenter: Aaron Sarson

Apache Spark: A Literature Review. Presenter: Aaron Sarson Apache Spark: A Literature Review Presenter: Aaron Sarson Outline Introduction to Spark Problem to be addressed Proposed Approach Ø Research Questions Contributions Results Ø RQ1, RQ2, RQ3 Conclusion &

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Sempala. Interactive SPARQL Query Processing on Hadoop

Sempala. Interactive SPARQL Query Processing on Hadoop Sempala Interactive SPARQL Query Processing on Hadoop Alexander Schätzle, Martin Przyjaciel-Zablocki, Antony Neu, Georg Lausen University of Freiburg, Germany ISWC 2014 - Riva del Garda, Italy Motivation

More information

BIG DATA COURSE CONTENT

BIG DATA COURSE CONTENT BIG DATA COURSE CONTENT [I] Get Started with Big Data Microsoft Professional Orientation: Big Data Duration: 12 hrs Course Content: Introduction Course Introduction Data Fundamentals Introduction to Data

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

(1) Department of Physics University Federico II, Via Cinthia 24, I Napoli, Italy (2) INAF Astronomical Observatory of Capodimonte, Via

(1) Department of Physics University Federico II, Via Cinthia 24, I Napoli, Italy (2) INAF Astronomical Observatory of Capodimonte, Via (1) Department of Physics University Federico II, Via Cinthia 24, I-80126 Napoli, Italy (2) INAF Astronomical Observatory of Capodimonte, Via Moiariello 16, I-80131 Napoli, Italy To measure the distance

More information

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F.

Applied Spark. From Concepts to Bitcoin Analytics. Andrew F. Applied Spark From Concepts to Bitcoin Analytics Andrew F. Hart ahart@apache.org @andrewfhart My Day Job CTO, Pogoseat Upgrade technology for live events 3/28/16 QCON-SP Andrew Hart 2 Additionally Member,

More information

Simba: Efficient In-Memory Spatial Analytics.

Simba: Efficient In-Memory Spatial Analytics. Simba: Efficient In-Memory Spatial Analytics. Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou and Minyi Guo SIGMOD 16. Andres Calderon November 10, 2016 Simba November 10, 2016 1 / 52 Introduction Introduction

More information

A Tour of LSST Data Management. Kian- Tat Lim DM Project Engineer and System Architect

A Tour of LSST Data Management. Kian- Tat Lim DM Project Engineer and System Architect A Tour of LSST Data Management Kian- Tat Lim DM Project Engineer and System Architect Welcome Aboard Choo Yut Shing @flickr, CC BY-NC-SA 2.0 2 What We Do Accept and archive images and metadata Generate

More information

3D visualization of astronomy data using immersive displays

3D visualization of astronomy data using immersive displays ithes coffee meeting Riken 2016-12-09 3D visualization of astronomy data using immersive displays Gilles Ferrand Research Scientist Astrophysical Big Bang Laboratory 01 A collaboration Astronomy Computer

More information

VAPE Virtual observatory Aided Publishing for Education

VAPE Virtual observatory Aided Publishing for Education VAPE Virtual observatory Aided Publishing for Education http://ia2-edu.oats.inaf.it:8080/vape VAPE is an application for the publication of educational data in the Virtual Observatory (VO). VAPE has been

More information

LASDA: an archiving system for managing and sharing large scientific data

LASDA: an archiving system for managing and sharing large scientific data LASDA: an archiving system for managing and sharing large scientific data JEONGHOON LEE Korea Institute of Science and Technology Information Scientific Data Strategy Lab. 245 Daehak-ro, Yuseong-gu, Daejeon

More information

ArcGIS Enterprise: An Introduction. Philip Heede

ArcGIS Enterprise: An Introduction. Philip Heede Enterprise: An Introduction Philip Heede Online Enterprise Hosted by Esri (SaaS) - Upgraded automatically (by Esri) - Esri controls SLA Core Web GIS functionality (Apps, visualization, smart mapping, analysis

More information

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager

An InterSystems Guide to the Data Galaxy. Benjamin De Boe Product Manager An InterSystems Guide to the Data Galaxy Benjamin De Boe Product Manager Analytics 3 InterSystems Corporation. All rights reserved. 4 InterSystems Corporation. All rights reserved. 5 InterSystems Corporation.

More information

Astrophysics with Terabytes. Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research

Astrophysics with Terabytes. Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research Astrophysics with Terabytes Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB

More information

A web portal to analyze and distribute cosmology data

A web portal to analyze and distribute cosmology data on Hadoop https://cosmohub.pic.es A web portal to analyze and distribute cosmology data J.Carretero, P.Tallada, J.Casals, M.Caubet, C.Neissner, N.Tonello, J.Delgado, F.Torradeflot, M.Delfino, S.Serrano,

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

Achieving Horizontal Scalability. Alain Houf Sales Engineer

Achieving Horizontal Scalability. Alain Houf Sales Engineer Achieving Horizontal Scalability Alain Houf Sales Engineer Scale Matters InterSystems IRIS Database Platform lets you: Scale up and scale out Scale users and scale data Mix and match a variety of approaches

More information

arxiv: v1 [cs.db] 21 Jun 2012

arxiv: v1 [cs.db] 21 Jun 2012 SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases arxiv:1206.5021v1 [cs.db] 21 Jun 2012 László Dobos 1,2, Tamás Budavári 2,

More information

Spark and HPC for High Energy Physics Data Analyses

Spark and HPC for High Energy Physics Data Analyses Spark and HPC for High Energy Physics Data Analyses Marc Paterno, Jim Kowalkowski, and Saba Sehrish 2017 IEEE International Workshop on High-Performance Big Data Computing Introduction High energy physics

More information

Case Study: CyberSKA - A Collaborative Platform for Data Intensive Radio Astronomy

Case Study: CyberSKA - A Collaborative Platform for Data Intensive Radio Astronomy Case Study: CyberSKA - A Collaborative Platform for Data Intensive Radio Astronomy Outline Motivation / Overview Participants / Industry Partners Documentation Architecture Current Status and Services

More information

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect Big Data Big Data Analyst INTRODUCTION TO BIG DATA ANALYTICS ANALYTICS PROCESSING TECHNIQUES DATA TRANSFORMATION & BATCH PROCESSING REAL TIME (STREAM) DATA PROCESSING Big Data Engineer BIG DATA FOUNDATION

More information

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs 2013 DOE Visiting Faculty Program Project Report By Jianting Zhang (Visiting Faculty) (Department of Computer Science,

More information

Use and validation of the IAU Astronomy Thesaurus in ontologies

Use and validation of the IAU Astronomy Thesaurus in ontologies Use and validation of the IAU Astronomy Thesaurus in ontologies N. Hernandez, J. Mothe (IRIT) P. Dubois, S. Lesteven, F. Genova, S. Derriere (CDS) A. Preite Martinez (INAF) R&T work on ontologies at CDS

More information

Recommender Systems New Approaches with Netflix Dataset

Recommender Systems New Approaches with Netflix Dataset Recommender Systems New Approaches with Netflix Dataset Robert Bell Yehuda Koren AT&T Labs ICDM 2007 Presented by Matt Rodriguez Outline Overview of Recommender System Approaches which are Content based

More information

Usage of the Astro Runtime

Usage of the Astro Runtime A PPARC funded project Usage of the Astro Runtime Noel Winstanley nw@jb.man.ac.uk AstroGrid, Jodrell Bank, UK AstroGrid Workbench A Rich GUI Client for the VO http://www.astrogrid.org/desktop Workbench

More information

Virtual Observatory Tools. Khadija EL Bouchefry

Virtual Observatory Tools. Khadija EL Bouchefry Virtual Observatory Tools Khadija EL Bouchefry AVN School -HartRAO- Feb 22, 2016 The Virtual Observatory VO What it the Virtual Observatory? What are VO tools? How can we use VO tools (for our Own research)

More information

Mining Massive Data Sets With CANFAR and Skytree. Nicholas M. Ball Canadian Astronomy Data Centre National Research Council Victoria, BC, Canada

Mining Massive Data Sets With CANFAR and Skytree. Nicholas M. Ball Canadian Astronomy Data Centre National Research Council Victoria, BC, Canada Mining Massive Data Sets With CANFAR and Skytree Nicholas M. Ball Canadian Astronomy Data Centre National Research Council Victoria, BC, Canada Collaborators David Schade (CADC) Alex Gray (Skytree and

More information

Using in-vehicle Sensor Data for Naturalistic Driving Analysis

Using in-vehicle Sensor Data for Naturalistic Driving Analysis Using in-vehicle Sensor Data for Naturalistic Driving Analysis K. Zeitouni, I. Sandu Popa (University of Versailles) G. Saint Pierre, F. Dupin, S. Glaser (LCPC-INRETS) Outline Context Motivating applications

More information

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg

High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg High-Performance Event Processing Bridging the Gap between Low Latency and High Throughput Bernhard Seeger University of Marburg common work with Nikolaus Glombiewski, Michael Körber, Marc Seidemann 1.

More information

Spatial Outlier Detection

Spatial Outlier Detection Spatial Outlier Detection Chang-Tien Lu Department of Computer Science Northern Virginia Center Virginia Tech Joint work with Dechang Chen, Yufeng Kou, Jiang Zhao 1 Spatial Outlier A spatial data point

More information

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and private study only. The thesis may not be reproduced elsewhere

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #14: Clustering Seoul National University 1 In This Lecture Learn the motivation, applications, and goal of clustering Understand the basic methods of clustering (bottom-up

More information

Structured Query Language for Virtual Observatory

Structured Query Language for Virtual Observatory Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P1-1-23 Structured Query Language for Virtual Observatory Yuji

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Big Data Hadoop Stack

Big Data Hadoop Stack Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

More information

Scalable Tools - Part I Introduction to Scalable Tools

Scalable Tools - Part I Introduction to Scalable Tools Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session

More information

Spark, Shark and Spark Streaming Introduction

Spark, Shark and Spark Streaming Introduction Spark, Shark and Spark Streaming Introduction Tushar Kale tusharkale@in.ibm.com June 2015 This Talk Introduction to Shark, Spark and Spark Streaming Architecture Deployment Methodology Performance References

More information

CSE 190D Spring 2017 Final Exam

CSE 190D Spring 2017 Final Exam CSE 190D Spring 2017 Final Exam Full Name : Student ID : Major : INSTRUCTIONS 1. You have up to 2 hours and 59 minutes to complete this exam. 2. You can have up to one letter/a4-sized sheet of notes, formulae,

More information

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training:: Module Title Duration : Cloudera Data Analyst Training : 4 days Overview Take your knowledge to the next level Cloudera University s four-day data analyst training course will teach you to apply traditional

More information

CarbonData: Spark Integration And Carbon Query Flow

CarbonData: Spark Integration And Carbon Query Flow CarbonData: Spark Integration And Carbon Query Flow SparkSQL + CarbonData: 2 Carbon-Spark Integration Built-in Spark integration Spark 1.5, 1.6, 2.1 Interface SQL DataFrame API Integration: Format Query

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Deep Character-Level Click-Through Rate Prediction for Sponsored Search Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as

More information

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science We have made it easy for you to find a PDF Ebooks without any digging. And by having access to our ebooks online or by storing

More information