Astrophysics with Terabytes. Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research

Similar documents
From Astrophysics to Sensor Networks: Facing the Data Explosion

The SDSS SkyServer and beyond. Alex Szalay

How Simulations and Databases Play Nicely. Alex Szalay, JHU Gerard Lemson, MPA

Sharing SDSS Data with the World

VIRTUAL OBSERVATORY TECHNOLOGIES

Web Services for the Virtual Observatory

Data-Intensive Science Using GPUs. Alex Szalay, JHU

Frequently Asked Questions


Extending the SDSS Batch Query System to the National Virtual Observatory Grid

Extreme Data-Intensive Scientific Computing on GPUs. Alex Szalay JHU

Databases in the Cloud

Introduction to Grid Computing

Exploiting Virtual Observatory and Information Technology: Techniques for Astronomy

Extreme Data-Intensive. Alex Szalay The Johns Hopkins University

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

High-Energy Physics Data-Storage Challenges

STREAMING ALGORITHMS. Tamás Budavári / Johns Hopkins University ANALYSIS OF ASTRONOMY IMAGES & CATALOGS 10/26/2015


The Mission of IDIES

Summary of Data Management Principles

Workload-Aware Data Partitioning in CommunityDriven Data Grids

Some Reflections on Advanced Geocomputations and the Data Deluge

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

The Virtual Observatory and the IVOA

ASKAP Pipeline processing and simulations. Dr Matthew Whiting ASKAP Computing, CSIRO May 5th, 2010

arxiv: v1 [cs.db] 2 Oct 2014

Efficient classification of billions of points into complex geographic regions using hierarchical triangular mesh

Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009

Euclid Archive Science Archive System

Time Domain Alerts from LSST & ZTF

The Virtual Observatory in Australia Connecting to International Initiatives. Peter Lamb. CSIRO Mathematical & Information Sciences

Modern Data Warehouse The New Approach to Azure BI

Batch is back: CasJobs, serving multi-tb data on the Web

Collaborative data-driven science. Collaborative data-driven science. Mike Rippin

SDSS Dataset and SkyServer Workloads

Cosmic Peta-Scale Data Analysis at IN2P3

Adaptive selfcalibration for Allen Telescope Array imaging

Yogesh Simmhan. escience Group Microsoft Research

Distributed Archive System for the Cherenkov Telescope Array

Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection

A Tour of LSST Data Management. Kian- Tat Lim DM Project Engineer and System Architect

Theme 7 Group 2 Data mining technologies Catalogues crossmatching on distributed database and application on MWA absorption source finding

Introduction to Data Management CSE 344

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V

Pan-STARRS Introduction to the Published Science Products Subsystem (PSPS)

Cisco Tetration Analytics Platform: A Dive into Blazing Fast Deep Storage

The NOAO Data Lab Design, Capabilities and Community Development. Michael Fitzpatrick for the Data Lab Team

Visualization & the CASA Viewer

The IPAC Research Archives. Steve Groom IPAC / Caltech

The Computation and Data Needs of Canadian Astronomy

Experiences Running OGSA-DQP Queries Against a Heterogeneous Distributed Scientific Database

Divide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io

Appliances and DW Architecture. John O Brien President and Executive Architect Zukeran Technologies 1

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology WISE Archive.

Data Processing for SUBARU Telescope using GRID

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Simplifying Collaboration in the Cloud

THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel

Drawing the Big Picture

Designing a Multi-petabyte Database for LSST

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Full file at

Mining Massive Data Sets With CANFAR and Skytree. Nicholas M. Ball Canadian Astronomy Data Centre National Research Council Victoria, BC, Canada

Data warehouses Decision support The multidimensional model OLAP queries

Data mining and Knowledge Discovery Resources for Astronomy in the Web 2.0 Age

Application Example Running on Top of GPI-Space Integrating D/C

DATA MINING II - 1DL460

1 INTRODUCTION ABSTRACT

Data Centres in the Virtual Observatory Age

Data publication and discovery with Globus

Chapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES

CC-IN2P3 / NCSA Meeting May 27-28th,2015

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

Technology for the Virtual Observatory. The Virtual Observatory. Toward a new astronomy. Toward a new astronomy

Chapter 1: Introduction to Spatial Databases

Copyright is owned by the Author of the thesis. Permission is given for a copy to be downloaded by an individual for the purpose of research and

TAP services integration at IA2 data center

Strategic Briefing Paper Big Data

LSGI 521: Principles of GIS. Lecture 5: Spatial Data Management in GIS. Dr. Bo Wu

Deep Storage for Exponential Data. Nathan Thompson CEO, Spectra Logic

BIG DATA TESTING: A UNIFIED VIEW

Cosmic Peta-Scale Data Analysis at IN2P3

The Data exacell DXC. J. Ray Scott DXC PI May 17, 2016

(1) Department of Physics University Federico II, Via Cinthia 24, I Napoli, Italy (2) INAF Astronomical Observatory of Capodimonte, Via

Top Five Reasons for Data Warehouse Modernization Philip Russom

Building on Existing Communities: the Virtual Astronomical Observatory (and NIST)

Massive Data Analysis

What we have covered?

What to do with Scientific Data? Michael Stonebraker

Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster

presented by: Tim Haithcoat University of Missouri Columbia

Hello, Thanks for the introduction

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Fault tolerant issues in large scale applications

PRO: Designing a Business Intelligence Infrastructure Using Microsoft SQL Server 2008

CS780: Topics in Computer Graphics

MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by

Center for Advanced Computing Research

5 Fundamental Strategies for Building a Data-centered Data Center

Transcription:

Astrophysics with Terabytes Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research

Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral, temporal, _ 1PB They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space Data doubles every year 1000 100 10 1 0.1 1970 1975 1980 1985 1990 2000 1995 CCDs Glass

The Challenges Exponential data growth: Distributed collections Soon Petabytes Data Collection Discovery and Analysis Publishing New analysis paradigm: Data federations, Move analysis to data New publishing paradigm: Scientists are publishers and Curators

Why Is Astronomy Special? Especially attractive for the wide public Community is not very large It has no commercial value No privacy concerns, freely share results with others Great for experimenting with algorithms It is real and well documented High-dimensional (with confidence intervals) Spatial, temporal Diverse and distributed Many different instruments from many different places and many different times The questions are interesting There is a lot of it (soon petabytes)

The Virtual Observatory Premise: most data is (or could be online) The Internet is the world s best telescope: It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up The seeing is always great It s a smart telescope: links objects and data to literature on them Software became the capital expense Share, standardize, reuse..

Spectrum Services Search, plot, and retrieve SDSS, 2dF, and other spectra The Spectrum Services web site is dedicated to spectrum related VO services. On this site you will find tools and tutorials on how to access close to 500,000 spectra from the Sloan Digital Sky Survey (SDSS DR1) and the 2 degree Field redshift survey (2dFGRS). The services are open to everyone to publish their own spectra in the same framework. Reading the tutorials on XML Web Services, you can learn how to integrate the 45 GB spectrum and passband database with your programs with few lines of code.

The SkyServer Experience 1.E+07 Sloan Digital Sky Survey: Pixels + Objects Web hits/mo About 500 attributes per object, 400M objects SQL queries/mo Spectra for 1M objects 1.E+06 Currently 2.4TB fully public Prototype escience lab Moving analysis to the data 1.E+05 Fast searches: color, spatial Visual tools Join pixels with objects 1.E+04 Prototype in data publishing 160 million web hits in 5 years http://skyserver.sdss.org/ 2001/7 2001/10 2002/1 2002/4 2002/7 2002/10 2003/1 2003/4 2003/7 2003/10 2004/1 2004/4 2004/7

Database Challenges Loading the Data Organizing the Data Accessing the Data Delivering the Data Analyzing the Data (spatial )

DB Loading Wrote automated table driven workflow system for loading Two-phase parallel load Over 16K lines of SQL code, mostly data validation Loading process was extremely painful Lack of systems engineering for the pipelines Lots of foreign key mismatches Fixing corrupted files (RAID5 disk errors) Most of the time spent on scrubbing data Once data is clean, everything loads in 1 week Reorganization of data is about 1 week

Data Delivery Small requests (<100MB) Anonymous, putting data on the stream Medium requests (<1GB) Queues with resource limits Large requests (>1GB) Save data in scratch area and use asynch delivery Only practical for large/long queries Iterative requests/workbench Save data in temp tables in user space Let user manipulate via web browser Paradox: if we use web browser to submit, users want immediate response even from large queries

Data Organization Jim Gray: 20 queries, now 35 queries Represent most typical access patterns Establishes a clean set of requirements that both sides can understand Also used a a regression test during development Dual star-schema centered on photometric objects Rich metadata information stored with schema

MyDB: Workbench Need to register power users, with their own DB Query output goes to MyDB Can be joined with source database Results are materialized from MyDB upon request Users can do: Insert, Drop, Create, Select Into, Functions, Procedures Publish their tables to a group area Data delivery via the CASJobs (C# WS) => Sending analysis to the data!

Public Data Release: Versions June 2001: EDR Early Data Release EDR July 2003: DR1 Contains 30% of final data DR1 DR1 150 million photo objects 3 versions of the data Target, Best, Runs Total catalog volume 5TB DR2 DR2 DR2 Published releases served forever EDR, DR1, DR2,., soon DR5 Personal (1GB), Professional (100GB) subsets Next: include email archives, annotations O(N 2 ) only possible because of Moore s Law! DR3 DR3 DR3 DR3

Mirroring Master copies kept at 2 sites Fermilab, JHU: 3 copies of everything running in parallel Necessary due to cheap hardware Multiple mirrors for DR2, DR3 Microsoft has multiple copies Japan, China, India, Germany, UK, Australia, Hungary, Brazil Replication First via Sneakernet, 3 boxes with 1TB of disks Now: distributed from StarTap at UIC, using the UDT protocol (Grossman et al) Transferred 1TB to Tokyo in about 2.2 hours

Web Interface Provide easy, visual access to exciting new data Access based on web services Illustrate that advanced content does not mean a cumbersome interface Understand new ways of publishing scientific data Target audience Advanced high-school students, amateur astronomers, wide public, professional astronomers Multilingual capabilities built in from the start Heavy use of stylesheets, language branches First version built by AS+JG, later redesigned by Curtis Wong, Microsoft

Tutorials and Guides Developed by J. Raddick and Postdocs How to use Excel How to use a database (guide to SQL) Expert advice on SQL Automated on-line documentation Ani Thakar, Roy Gal Database information, Glossary, Algorithms Searchable Help All stored in the DB, and generated on the fly DB schema and metadata (documentation) generated from a single source, parsed different ways Templated for others, widely used in astronomy Now also for sensor networks

Visual Tools Goal: Connect pixel space to objects without typing queries Browser interface, using common paradigm (MapQuest) Challenge: Images: 200K x 2K x1.5k resolution x 5 colors = 3 Terapix 300M objects with complex properties 20K geometric boundaries and about 6M masks Need large dynamic range of scales (2^13) Assembled from a few building blocks: Image Cutout SOAP Web Service SQL query service + database Images+overlays built on server side -> simple client

Spatial Information for Users What resources are in this part of the sky? What is the common area of these surveys? Is this point in the survey? Give me all objects in this region Give me all good objects (not in bad areas) Cross-matching these two catalogs Give me the cumulative counts over areas Compute fast spherical transforms of densities Interpolate sparsely sampled functions (extinction maps, dust temperature, )

Geometries SDSS has lots of complex boundaries 20,000 regions 6M masks (bright stars, satellites, etc), represented as spherical polygons A GIS-like library built in C++ and SQL Now converted to C# for direct plugin into SQL Server 2005 (17 times faster than C++) Precompute arcs and store in database for rendering Functions for point in polygon, intersecting polygons, polygons covering points, all points in polygon Fast searches using spherical quadtrees (HTM) Visualization integrated with web services

Things Can Get Complex

Indexing Using Quadtrees Cover the sky with hierarchical pixels COBE start with a cube Hierarchical Triangular Mesh (HTM) uses trixels Samet, Fekete Start with an octahedron, and split each triangle into 4 children, down to 20 levels deep Smallest triangles are 0.3 Each trixel has a unique htmid 2,2 2,0 2,1 2,3 2,3,0 2,3,1 2,3,2 2,3,3

Space-Filling Curve [0.12,0.13) [0.123,0.130) [0.120,0.121) [0.121,0.122) [0.122,0.123) [0.123,0.130) [0.120,0.121) [0.121,0.122) So triangles correspond to ranges All points inside the triangle are inside the range. 1,2,2 1,2,1 1,2,0 1,2,3 1,1,0 1,1,3 1,0,2 1,1,1 1,3,2 1,0,1 1,1,2 1,3,1 1,0,0 1,0,3 1,3,0 1,3,3

Comparing Massive Point Datasets Want to cross match 10 9 objects with another set of 10 9 objects Using the HTM code, that is a LOT of calls (~10 6 seconds ~ a year) Want to do it in an hour Added 10 databases running in parallel and a bulk algorithms based on zones

Trends CMB Surveys 1990 COBE 1000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck 10 Million Angular Galaxy Surveys 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2008 VISTA 1000M 2012 LSST 3000M Time Domain QUEST SDSS Extension survey Dark Energy Camera PanStarrs SNAP LSST Galaxy Redshift Surveys 1986 CfA 3500 1996 LCRS 23000 2003 2dF 250000 2005 SDSS 750000 Petabytes/year by the end of the decade

Analysis of Large Data Sets Data reside in multiple databases Applications access data dynamically Typical data set sizes in millions to billions Typical dimensionality 3-10 Attributes are correlated in a non-linear fashion One cannot assume a global metric Data have non-negligible error Current examples: 2-3 point angular correlations of about 10 million galaxies 3D power spectrum of about 1 million galaxy distances Image shear due to gravitational lensing in 2.5 Terapixels Compare 1 million galaxy spectra to 100M model spectra

Next-Generation Data Analysis Looking for Needles in haystacks the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Optimal statistics have poor scaling Correlation functions are N 2, likelihood techniques N 3 For large data sets main errors are not statistical As data and computers grow with Moore s Law, we can only keep up with N logn A way out? Discard notion of optimal (data is fuzzy, answers are approximate) Don t assume infinite computational resources or memory Requires combination of statistics & computer science

Working with Simulations Galaxy Formation Simulations Collaboration with V. Springel, S. White and G. Lemson Loading the Millennium galaxies/halos into a database, build Peano-Hilbert curve spatial index 20 queries: focused on merger history of halos Semi-analytic star formation: linkage with population synthesis models Statistical studies (with Jie Wang) 50 separate realizations of the large scale structure in Millennium Analyzing non-gaussian signatures on large scales (PDF, 4 th moments) Currently loading these into the DB for analysis of nonlinearities on large scales Next: build fast ray tracing algorithms along light cone inside the DB

Simulation of Turbulence Collaboration with C. Meneveau, S. Chen, G. Eyink, R. Burns and E. Vishniac Take a 1024^3 simulation and store 1024 timesteps in a database Use hierarchical Peano-Hilbert curve index within each time step, atoms are 64^3 cells Database is tens of TB => build a cluster of cheap servers Currently 100TB in place, loader works Implementing low-level analysis tools inside DB Interpolation schemes done, FFTW ported Cheap disks => redundant data storage in dual space

Wireless Sensor Networks Collaboration with K. Szlavecz, A. Terzis, J. Gray, S. Ozer Building 200 node network to measure soil moisture for environmental monitoring Expect 200 million measurements /yr Deriving from the SkyServer template we were able to build and end-to-end system in less than two weeks Built a OLAP datacube, conditional sums along multiple dimensional axes

Summary Data growing exponentially Analyzing so much data requires a new model More data coming: Petabytes/year by 2010 Need scalable solutions Move analysis to the data Spatial and temporal features essential 20 queries! Same thing happening in all sciences High energy physics, genomics, cancer research, medical imaging, oceanography, remote sensing, escience: an emerging new branch of science