Sharing SDSS Data with the World

Similar documents
Extending the SDSS Batch Query System to the National Virtual Observatory Grid

Frequently Asked Questions

Astrophysics with Terabytes. Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research

Batch is back: CasJobs, serving multi-tb data on the Web

From Astrophysics to Sensor Networks: Facing the Data Explosion

Vol. 10, No. 1 January/February 2008

SDSS Dataset and SkyServer Workloads

Databases in the Cloud

VIRTUAL OBSERVATORY TECHNOLOGIES

Pan-STARRS Introduction to the Published Science Products Subsystem (PSPS)


The SDSS SkyServer and beyond. Alex Szalay

VAPE Virtual observatory Aided Publishing for Education

Experiences Running OGSA-DQP Queries Against a Heterogeneous Distributed Scientific Database

STREAMING ALGORITHMS. Tamás Budavári / Johns Hopkins University ANALYSIS OF ASTRONOMY IMAGES & CATALOGS 10/26/2015

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

Modern Data Warehouse The New Approach to Azure BI

Theme 7 Group 2 Data mining technologies Catalogues crossmatching on distributed database and application on MWA absorption source finding

Advanced Google Local Maps Ranking Strategies for Local SEO Agencies

1 Introduction. 1.1 Background

Using the Web in Your Teaching

Collaborative data-driven science. Collaborative data-driven science

Astronomical spectrographs. ASTR320 Wednesday February 20, 2019

Collaborative data-driven science. Collaborative data-driven science. Mike Rippin

Data Processing for SUBARU Telescope using GRID

Furl Furled Furling. Social on-line book marking for the masses. Jim Wenzloff Blog:

SPIcam: an overview. Alan Diercks Institute for Systems Biology 23rd July 2002

Data Centres in the Virtual Observatory Age

The NOAO Data Lab Design, Capabilities and Community Development. Michael Fitzpatrick for the Data Lab Team

The Canadian CyberSKA Project

Distributed Archive System for the Cherenkov Telescope Array

The Virtual Observatory in Australia Connecting to International Initiatives. Peter Lamb. CSIRO Mathematical & Information Sciences

Summary of Data Management Principles

Populating the Galaxy Zoo

Exploiting Virtual Observatory and Information Technology: Techniques for Astronomy

Web Services for the Virtual Observatory

Compact Muon Solenoid: Cyberinfrastructure Solutions. Ken Bloom UNL Cyberinfrastructure Workshop -- August 15, 2005

Case Study: CyberSKA - A Collaborative Platform for Data Intensive Radio Astronomy

The Virtual Observatory and the IVOA

Spitzer Heritage Archive

escience in the Cloud Dan Fay Director Earth, Energy and Environment

Adobe Spark. Schools and Educators. A Guide for. spark.adobe.com

Euclid Archive Science Archive System

Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection

LazyBase: Trading freshness and performance in a scalable database

OU-VIS: Status. H.J. McCracken. and the OU-VIS team

The AOLI low-order non-linear curvature wavefront sensor

Designing a Multi-petabyte Database for LSST

The Future of High Performance Computing

A Tour of LSST Data Management. Kian- Tat Lim DM Project Engineer and System Architect

Introduction to Grid Computing

Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009

How Simulations and Databases Play Nicely. Alex Szalay, JHU Gerard Lemson, MPA

Course 40045A: Microsoft SQL Server for Oracle DBAs

Design and Implementation of the Japanese Virtual Observatory (JVO) system Yuji SHIRASAKI National Astronomical Observatory of Japan

Effective Communication in Research Administration

What cameras can do for you and how they do it!

Software + Services for Data Storage, Management, Discovery, and Re-Use

Saranya Sriram Developer Evangelist Microsoft Corporation India

Cloud Computing Techniques for Big Data and Hadoop Implementation

Cloud Computing at Yahoo! Thomas Kwan Director, Research Operations Yahoo! Labs

JWST Pipeline & Data Products

Microsoft Windows SharePoint Services

Technology for the Virtual Observatory. The Virtual Observatory. Toward a new astronomy. Toward a new astronomy

SIA V2 Analysis, Scope and Concepts. Alliance. Author(s): D. Tody (ed.), F. Bonnarel, M. Dolensky, J. Salgado (others TBA in next version)

Welcome to the XSEDE Big Data Workshop

Building a New Rational Web Site with Rational Suite

VERY LARGE TELESCOPE 3D Visualization Tool Cookbook

Knowledge-based Grids

CREATIVITY MAKES THE DIFFERENCE

Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback

Gaia Catalogue and Archive Plans and Status

Simulation and Auxiliary Data Management

The SDSS SkyServer Public Access to the Sloan Digital Sky Server Data 1

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

The Portal Aspect of the LSST Science Platform. Gregory Dubois-Felsmann Caltech/IPAC. LSST2017 August 16, 2017

What is Data Warehouse like

Identifying Workloads for the Cloud

FIFI-LS: Basic Cube Analysis using SOSPEX

Finding a needle in Haystack: Facebook's photo storage

User Guide. Chapter 6. Teacher Pages

THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA

CloudSwyft Learning-as-a-Service Course Catalog 2018 (Individual LaaS Course Catalog List)

HSC Data Processing Environment ~Upgrading plan of open-use computing facility for HSC data


LSST: Crea*ng a Digital Universe

APPENDIX 4 Migrating from QMF to SAS/ ASSIST Software. Each of these steps can be executed independently.

THE EUCLID ARCHIVE SYSTEM: A DATA-CENTRIC APPROACH TO BIG DATA

VEGA, D. Mourard et al. Observatoire de la Côte d Azur, march 15th

Efficient classification of billions of points into complex geographic regions using hierarchical triangular mesh

Lesson 3 Transcript: Part 1 of 2 - Tools & Scripting

CCD Astronomy. Imaging the Deep Sky. Ken Westall. Abell 1656 Coma Galaxy Cluster

Getting Started. Using Aesop Successfully. Log on to Aesop. Create absences online or on the phone

Using PBworks in the Classroom/Library. What is a wiki? Wiki means quick or fast in Hawaiian. How to use a wiki for your classroom/library

ClassLink Student Directions

Data-Intensive Distributed Computing

Interactive White Board

Big Data - Some Words BIG DATA 8/31/2017. Introduction

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data Greenplum DBA Online Training

Transcription:

The Sloan Digital Sky Survey Sharing SDSS Data with the World Ani Thakar Jordan Raddick Center for Astrophysical Sciences, The Johns Hopkins University

Outline Sharing Astronomy Data SDSS Overview Data Access Tools Data to Astronomers Data to the Public A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 2

Astronomy Data In the old days astronomers took photos. Starting in the 1960 s they began to digitize. New instruments are digital (100s of GB/nite) Detectors are following Moore s law. Data avalanche: double every 2 years 1970 1975 1980 1985 1990 1995 2000 CCDs 100 10 1 1000 0.1 Glass Total area of 3m+ telescopes in the world in m 2, total number of CCD pixels in Megapix, as a function of time. Growth over 25 years is a factor of 30 in glass, 3000 in pixels. A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 3

Astronomy Data Astronomers have a few Petabytes now. Data doubles every 2 years. Data is public after 2 years. So, 50% of the data is public. Some have private access to 5% more data. But.. How do I get at that 50% of the data? A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 4

New Astronomy- Different! Data Avalanche The flood of Terabytes of data is already happening, whether we like it or not Present techniques do not scale well Drinking from the Fire Hose! Systematic data exploration Will have a central role Statistical analysis of typical objects Automated search for the rare events Digital archives of the sky A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 5

Data Intensive Science Data avalanche in astronomy and other sciences Need new paradigm to deal with it Old file-based solutions do not cut it Old programming models don t work Scientists need to learn new tricks A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 6

Data Intensive Science Tools Databases instead of files (DBMSs) Data integrity ensured Optimized data access (DB indices) Ability to define methods on data Do the science inside the database! Bring the analysis to the data, not vice-versa Scalable (parallel) data access High-speed transport protocols Move the data rapidly when necessary Asynchronous Web services Web browsers cannot handle data volumes A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 7

Virtual Observatory Premise: Most data is (or could be) online So, the Internet is the world s best telescope: It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up. The observing conditions are always great (no working at night, no clouds no moons no..). It s a smart telescope: links objects and data to literature on them. World-Wide Telescope coming soon! A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 8

The Sloan Digital Sky Survey Apache Point Obs., NM SDSS Sky Coverage SDSS Camera with 54 CCD chips www.sdss.org Digital map in 5 spectral bands covering ¼ of the sky. Will obtain 40 TB of raw pixel data. Photometric catalog with more than 200 million objects. Spectra of ~ 1 million objects. Data Release 6 DR6: 350 M images, 1.1M spectra. Next release Data Release 7 DR7: 400 M images, 1.2 M spectra. A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 9

Ground-breaking Instruments Multi-Fiber SpectroGraph (JHU) A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 10

An Ambitious Survey Info content > US Library of Congress Before SDSS, total number of galaxies with measured parameters ~ 200k With SDSS, we already have detailed parameters for over 200 Million galaxies!! A thousand-fold increase in the amount of data! A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 11

SDSS Data Releases 200GB 1TB 2TB 3TB 3.8TB EDR DR1 DR2 DR3 DR4 4.5TB DR5 5TB DR6 Jan 2001 Jan Jan Jan Jun Jun Jun Jun 2002 2003 2004 Jan 2005 Jun Jan 2006 Jun Jan 2007 Rel Date CAS Size Images Spectra Sq Deg CAS Mirrors DR6 6/29/07 5 TB 300 M 885k 8520 Hungary, India DR5 6/28/06 4.5 TB 240 M 740k 8000 JHU,Portsmouth,STScI, Hungary, Moscow DR4 6/29/05 3.8 TB 180 M 608k 6670 JHU, India DR3 9/27/04 3 TB 141M 478k 5200 JHU,India,Portsmouth DR2 4/15/04 2 TB 88M 330k 3324 JHU, UPitt, SDSC, Germany DR1 6/15/03 1 TB 53M 186k 2099 JHU, SDSC, CDAC, UPitt UK, Germany, Japan, India EDR 6/06/01 200 GB 14M 54k 462 JHU, SDSC, UK (ROE), Japan A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 12

SDSS Data Overview Data Archive Server (DAS) FITS files (raw data) Images, spectra, corrected frames, atlas images, binned images, masks Online form-based access Rsync and wget file retrieval Catalog Archive Server (CAS) Science parameters extracted to catalogs Stuffed into relational DBMS (SQL Server) Heavily indexed, optimized Online access via SkyServer Several levels of access, query tools www.sdss.org das.sdss.org/drx-cgi-bin/das SDSS Data Release cas.sdss.org skyserver.sdss.org A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 13

The CAS Databases SDSS Pipelines Processed data is stuffed into a commercial relational DBMS Microsoft SQL Server 2000 Allows fast exploration and analysis - Data Mining Two versions of the sky: Best and Target Target is version of sky on which spectroscopic targets were chosen Best is latest, greatest processing of the data 2 DBs for each release, e.g. BestDR3 and TargDR3 Heavily indexed to speed up access HTM + DB Indices Short queries can run interactively. Long queries (> 1 hr) require a custom Batch Query System. A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 14

SkyServer Web Access Supports several levels of SQL access Novice/casual users Radial (cone) and Rectangular search Intermediate/astro users Imaging and Spectro Form Query Expert Free-form SQL, Object Crossid (upload RA/dec list) CasJobs workbench environment (MyDB) Visual tools ImageCutout service Finding Chart, Navigate/browse images, image lists Explore tool Detailed info for each image object, including spectrum Downloadable interfaces Emacs, Java tool (sdssqa), sqlcl (command-line) http://cas.sdss.org/, http://skyserver.sdss.org/ A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 15

CAS Workload Management Query execution time follows power law Vast majority of queries under a minute Separate short and long queries Execute long queries in batch mode 1000000 100000 10000 1000 100 10 1 0 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 elapsed cpu rows 32768 65536 26214 52428 Short and long queues (short=1min, long=8hrs) Strictly limit time of query in a queue Provide user workspace DB MyDB Reduce network traffic from repeat and intermediate results Allow sharing of query results between user groups A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 16

CasJobs and MyDB Batch Query Workbench for SDSS CAS Queries are queued for batch execution Load balancing queues on multiple servers Limit of 2 simultaneous queries per server Short (1 minute) queue for immediate mode Query aborted after 1 minute MyDB personal database 1 GB (more on demand) SQL DB for each user Long queries write to MyDB table by default User can extract output (download) when ready Ability to share MyDB tables with other users via group visibility A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 17

MyDB Features Tables Views Functions Procedures Extract Publish Create Drop Rename A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 18

Job Management Asynchronous Separate query and output jobs BatchAdmin DB keeps track of session, jobs, servers, queues and user privileges Job History page allows user to monitor, cancel, and resubmit jobs A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 19

Statistical Analysis Programmability in schema Analysis stored right in the database Workbench environment Can handle arbitrary computational queries Results can be saved server-side in MyDB Iterative analysis possible Provide capability to download results to their own machines for further analysis Not trivial with TBs of data!! A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 20

Usage Logging You Fastidiously have only two log everything things to fear failure Web and Hits success. SQL Queries Jim Gray IP domains, query execution times etc. Analyze workload, performance Logs harvested to JHU hourly I don t know what the secret to success is, Currently only FNAL and JHU logged but the secret to failure is to try to please Mirror sites to be added in the future everyone. Invaluable resource for planning, Bill Cosby designing improvements A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 21

Data to the Public: from Access to Learning Jordan Raddick, Ani Thakar, Alex Szalay Johns Hopkins University Robert Sparks National Optical Astronomy Observatory Jim Gray Microsoft Research Countless others

Why data to the public? Commitment to data sharing Free choice people can look at any star or galaxy Inquiry learning known to be effective Learners answer a question themselves, with their own design A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 23

Audiences for data Formal education Intro college students and teachers K-12 students and teachers (science and more) Amateur astronomers Museum visitors Media General public Access must be tailored to each audience A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 24

Formal Ed: SkyServer A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 25

Browse Data A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 26

Explore Data A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 27

Search for Data A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 28

From Access to Learning Not enough just to give public access to data How are they using it? Are they learning anything from it? Do they understand what they re doing and why?* *Understanding = long-term memory, transfer to new situations (How People Learn) A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 29

Formal Education: Projects A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 30

SkyServer projects Three levels Kids(K-8) Basic (high school or Astro 101) Advanced (skilled and motivated students) Research challenges Independent research (open inquiry) with data Activities created by users A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 31

SkyServer projects A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 32

Use of educational projects 2 of top 20 IPs to hit entire site are K-12 districts Orlando, FL & Los Angeles A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 33

Effectiveness of projects Not just who s using it, but how effectively How much are students really learning? Results are A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 34

Effectiveness of projects (education research is really, really hard) This is a top priority in the future A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 35

Lessons Learned Users appreciate trust implied by giving them real data Public tools also used by astronomers But astronomers want too many options for public Scope creep is deadly SkyServer projects are good for highachieving students with knowledgeable teachers But it confuses others A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 36

Lessons Learned Instructional design is easy to do, hard to do right Teacher guides must be very detailed Huge audience for citizen science Rare chance to show science in action But research problem must be authentic A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 37

Decisions to Make (for SkyServer and for all of us) Who is our audience in formal education? Top 2% of students? People getting only one exposure to science? Some of both? What interactive tools can we make for museums? You have 30 seconds to get attention, teach A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 38

Contact information Contact Information Jordan Raddick 410-516-8889 raddick@jhu.edu A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 39

Citizen science: Galaxy Zoo Background: galaxies come in different shapes spiral elliptical Classifying hard for computers, easy for people So get the public to help! A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 40

Galaxy Zoo A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 41

Citizen science learning Even people who love science often don t understand peer review, proposals, etc. Want to use Galaxy Zoo to show process of science As we write science papers, we share results with volunteers A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 42

Galaxy Zoo blog A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 43

Use of projects What projects are users hitting? When? (Warning! Warning!): Not yet cleaned data Most hits on Tues, Thurs; fewest on weekend Looked at project hits in April 2006 School in session But, hit counts to projects are underestimates, because people have to get data too A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 44

Use of educational projects Count hits with hit logs Cleaned data Removed webcrawler hits SQL analysis Hits & page views doubled each year 2 of top 20 hitting groups (by IP) are school districts Orlando, FL & Los Angeles A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 45

Use of projects Hits/projects by type (total hits / # projects) Games 8% Advanced 27% Kids 28% Challenges 11% Basic 26% A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 46

Future Needs Back up all design decisions with education research Now working with JHU School of Education Study: does use of real data make a difference for student understanding of X? X = astronomy content, process of science, attitude toward science Ed.D. dissertation in waiting to happen Need a community of people interested in this A. Thakar & J. Raddick, JHU Data Sharing Workshop February 5, 2014 47