From Astrophysics to Sensor Networks: Facing the Data Explosion
|
|
- Sharleen Allen
- 5 years ago
- Views:
Transcription
1 From Astrophysics to Sensor Networks: Facing the Data Explosion Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research
2 Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB Multi-spectral, temporal, 1PB They mine it looking for new (kinds of) objects or more of interesting ones (quasars), density variations in 400-D space correlations in 400-D space Data doubles every year CCDs Glass
3 The Challenges Exponential data growth: Distributed collections Soon Petabytes Data Collection Discovery and Analysis Publishing New analysis paradigm: Data federations, Move analysis to data New publishing paradigm: Scientists are publishers and Curators
4 Publishing Data Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists Exponential growth: Projects last at least 3-5 years Data sent upwards only at the end of the project Data will never be centralized More responsibility on projects Becoming Publishers and Curators Data will reside with projects Analyses must be close to the data
5 Accessing Data If there is too much data to move around, take the analysis to the data! Do all data manipulations at database Build custom procedures and functions in the database Automatic parallelism guaranteed Easy to build-in custom functionality Databases & Procedures being unified Example temporal and spatial indexing Pixel processing Easy to reorganize the data Multiple views, each optimal for certain analyses Building hierarchical summaries are trivial Scalable to Petabyte datasets active databases!
6 Why Is Astronomy Special? Especially attractive for the wide public Community is not very large It has no commercial value No privacy concerns, freely share results with others Great for experimenting with algorithms It is real and well documented High-dimensional (with confidence intervals) Spatial, temporal Diverse and distributed Many different instruments from many different places and many different times The questions are interesting There is a lot of it (soon petabytes)
7 The Virtual Observatory Premise: most data is (or could be online) The Internet is the world s best telescope: It has data on every part of the sky In every measured spectral band: optical, x-ray, radio.. As deep as the best instruments (2 years ago). It is up when you are up The seeing is always great It s a smart telescope: links objects and data to literature on them Software became the capital expense Share, standardize, reuse..
8 Similarities to Particle Physics Particle Physics Van de Graaf Cyclotrons National Labs International (CERN) SSC vs LHC Optical Astronomy 2.5m telescopes 4m telescopes 8m class telescopes Surveys/Time Domain m telescopes Similar trends with a 20 year delay, fewer and ever bigger projects increasing fraction of cost is in software more conservative engineering Can the exponential continue, or will be linear or logistic? What can we learn from High Energy Physics?
9 Sloan Digital Sky Survey Goal Create the most detailed map of the Northern sky The Cosmic Genome Project Two surveys in one Photometric survey in 5 bands Spectroscopic redshift survey Automated data reduction 150 man-years of development High data volume 40 TB of raw data 5 TB processed catalogs Data is public The The University University of of Chicago Chicago Princeton Princeton University University The The Johns Johns Hopkins Hopkins University University The The University University of of Washington Washington New New Mexico Mexico State State University University Fermi Fermi National National Accelerator Accelerator Laboratory Laboratory US US Naval Naval Observatory Observatory The The Japanese Japanese Participation Participation Group Group The The Institute Institute for for Advanced Advanced Study Study Max Max Planck Planck Inst, Inst, Heidelberg Heidelberg Sloan Sloan Foundation, Foundation, NSF, NSF, DOE, DOE, NASA NASA
10 The Imaging Survey Drift scan of 10,000 square degrees 24k x 1M pixel panoramic images in 5 colors broad-band filters (u,g,r,i,z) 2.5 Terapixels of images
11 The Spectroscopic Survey Expanding universe redshift = distance SDSS Redshift Survey 1 million galaxies 100,000 quasars 100,000 stars Two high throughput spectrographs spectral range Å 640 spectra simultaneously R=2000 resolution, 1.3 Å Features Automated reduction of spectra Very high sampling density and completeness
12 Precision Cosmology Power Spectrum of Fluctuations few percent accuracy! Main challenge: with so much data the dominant errors are systematic, not statistical! Using large simulations to understand significance of detection
13 The SkyServer Portal Sloan Digital Sky Survey: Pixels + Objects About 500 attributes per object, 400M objects Currently 2.4TB fully public Prototype escience lab (800 users) Moving analysis to the data Fast searches: color, spatial Visual tools Join pixels with objects Prototype in data publishing 200 million web hits in 5 years 930,000 distinct users
14 2004/7 2004/4 1.E+07 1.E+06 1.E+05 1.E+04 SkyServer Traffic Web hits/mo SQL queries/mo 2001/ /1 2002/4 2002/7 2002/ /1 2003/4 2003/7 2003/ /1 2001/7
15 Database Challenges Loading the Data Organizing the Data Accessing the Data Delivering the Data Analyzing the Data (spatial )
16 Data Organization Jim Gray: 20 queries, now 35 queries Represent most typical access patterns Establishes a clean set of requirements that both sides can understand Also used a a regression test during development Dual star-schema centered on photometric objects Rich metadata information stored with schema
17 DB Loading Wrote automated table driven workflow system for loading Two-phase parallel load Over 16K lines of SQL code, mostly data validation Loading process was extremely painful Lack of systems engineering for the pipelines Lots of foreign key mismatches Fixing corrupted files (RAID5 disk errors) Most of the time spent on scrubbing data Once data is clean, everything loads in 1 week Reorganization of data is about 1 week
18 Data Delivery Small requests (<100MB) Anonymous, putting data on the stream Medium requests (<1GB) Queues with resource limits Large requests (>1GB) Save data in scratch area and use asynch delivery Only practical for large/long queries Iterative requests/workbench Save data in temp tables in user space Let user manipulate via web browser
19 MyDB: Workbench Need to register power users, with their own DB Query output goes to MyDB Can be joined with source database Results are materialized from MyDB upon request Users can do: Insert, Drop, Create, Select Into, Functions, Procedures Publish their tables to a group area Data delivery via the CASJobs (C# WS) => Sending analysis to the data!
20 Public Data Release: Versions June 2001: EDR Early Data Release EDR July 2003: DR1 Contains 30% of final data DR1 DR1 150 million photo objects Now at DR5, with 2.4TB 3 versions of the data Target, Best, Runs DR2 DR2 DR2 Total catalog volume 5TB Published releases served forever EDR, DR1, DR2,., soon DR5 DR3 DR3 DR3 DR3 Personal (1GB), Professional (100GB) subsets Next: include archives, annotations O(N 2 ) only possible because of Moore s Law!
21 Mirroring Master copies kept at 2 sites Fermilab, JHU: 3 copies of everything running in parallel Necessary due to cheap hardware Multiple mirrors for DR2, DR3 Microsoft has multiple copies Japan, China, India, Germany, UK, Australia, Hungary, Brazil Replication First via Sneakernet, 3 boxes with 1TB of disks Now: distributed from StarTap at UIC, using the UDT protocol (Grossman et al) Transferred 1TB to Tokyo in about 2.2 hours
22 Web Interface Provide easy, visual access to exciting new data Access based on web services Illustrate that advanced content does not mean a cumbersome interface Understand new ways of publishing scientific data Target audience Advanced high-school students, amateur astronomers, wide public, professional astronomers Multilingual capabilities built in from the start Heavy use of stylesheets, language branches First version built by AS+JG, later redesigned by Curtis Wong, Microsoft Many other contributors (Ani Thakar+ ) Meticulous logging and log harvesting of mirrors
23 Tutorials and Guides Developed by J. Raddick and Postdocs How to use Excel How to use a database (guide to SQL) Expert advice on SQL Automated on-line documentation Ani Thakar, Roy Gal Database information, Glossary, Algorithms Searchable Help All stored in the DB, and generated on the fly DB schema and metadata (documentation) generated from a single source, parsed different ways Templated for others, widely used in astronomy Now also for sensor networks
24 Visual Tools Goal: Connect pixel space to objects without typing queries Browser interface, using common paradigm (MapQuest) Challenge: Images: 200K x 2K x1.5k resolution x 5 colors = 3 Terapix 300M objects with complex properties 20K geometric boundaries and about 6M masks Need large dynamic range of scales (2^13) Assembled from a few building blocks: Image Cutout SOAP Web Service SQL query service + database Images+overlays built on server side -> simple client Spectrum Service, now also Footprint service, etc.
25 Spectrum Services Search, plot, and retrieve SDSS, 2dF, and other spectra The Spectrum Services web site is dedicated to spectrum related VO services. On this site you will find tools and tutorials on how to access close to 500,000 spectra from the Sloan Digital Sky Survey (SDSS DR1) and the 2 degree Field redshift survey (2dFGRS). The services are open to everyone to publish their own spectra in the same framework. Reading the tutorials on XML Web Services, you can learn how to integrate the 45 GB spectrum and passband database with your programs with few lines of code.
26 Spatial Information for Users What resources are in this part of the sky? What is the common area of these surveys? Is this point in the survey? Give me all objects in this region Give me all good objects (not in bad areas) Cross-matching these two catalogs Give me the cumulative counts over areas Compute fast spherical transforms of densities Interpolate sparsely sampled functions (extinction maps, dust temperature, )
27 Geometries SDSS has lots of complex boundaries 20,000 regions 6M masks (bright stars, satellites, etc), represented as spherical polygons A GIS-like library built in C++ and SQL Now converted to C# for direct plugin into SQL Server 2005 (17 times faster than C++) Precompute arcs and store in database for rendering Functions for point in polygon, intersecting polygons, polygons covering points, all points in polygon Fast searches using spherical quadtrees (HTM) Visualization integrated with web services
28 Things Can Get Complex A B A B Green area: A (B- ε) should find B if it contains an A and not masked Yellow area: A (B±ε) is an edge case may find B if it contains an A.
29 Indexing Using Quadtrees Cover the sky with hierarchical pixels COBE start with a cube Hierarchical Triangular Mesh (HTM) uses trixels Samet, Fekete Start with an octahedron, and split each triangle into 4 children, down to 20 levels deep Smallest triangles are 0.3 Each trixel has a unique htmid 2,2 2,0 2,1 2,3 2,3,0 2,3,1 2,3,2 2,3,3
30 Comparing Massive Point Datasets Want to cross match 10 9 objects with another set of 10 9 objects Using the HTM code, that is a LOT of calls (~10 6 seconds ~ a year) Want to do it in an hour Added 10 databases running in parallel and a bulk algorithms based on zones
31 Trends CMB Surveys (pixels) 1990 COBE Boomerang 10, CBI 50, WMAP 1 Million 2008 Planck 10 Million Angular Galaxy Surveys (obj) 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2008 VISTA 1000M 2012 LSST 3000M Time Domain QUEST SDSS Extension survey Dark Energy Camera PanStarrs SNAP LSST Galaxy Redshift Surveys (obj) 1986 CfA LCRS dF SDSS Petabytes/year by the end of the decade
32 Analysis of Large Data Sets Data reside in multiple databases Applications access data dynamically Typical data set sizes in millions to billions Typical dimensionality 3-10 Attributes are correlated in a non-linear fashion One cannot assume a global metric Data have non-negligible error Current examples: 2-3 point angular correlations of about 10 million galaxies 3D power spectrum of about 1 million galaxy distances Image shear due to gravitational lensing in 2.5 Terapixels Compare 1 million galaxy spectra to 100M model spectra
33 Next-Generation Data Analysis Looking for Needles in haystacks the Higgs particle Haystacks: Dark matter, Dark energy Needles are easier than haystacks Optimal statistics have poor scaling Correlation functions are N 2, likelihood techniques N 3 For large data sets main errors are not statistical As data and computers grow with Moore s Law, we can only keep up with N logn A way out? Discard notion of optimal (data is fuzzy, answers are approximate) Don t assume infinite computational resources or memory Requires combination of statistics & computer science
34 Cosmological Simulations Galaxy Formation Simulations Collaboration with V. Springel, S. White and G. Lemson Loading the Millennium galaxies/halos into a database, build Peano-Hilbert curve spatial index 20 queries: focused on merger history of halos Semi-analytic star formation: linkage with population synthesis models Statistical studies (with Jie Wang) 50 separate realizations of the large scale structure in Millennium Analyzing non-gaussian signatures on large scales (PDF, 4 th moments) Currently loading these into the DB for analysis of nonlinearities on large scales Next: build fast ray tracing algorithms along light cone inside the DB
35 Simulation of Turbulence Collaboration with C. Meneveau, S. Chen, G. Eyink, R. Burns and E. Vishniac Take a 1024^3 simulation and store 1024 timesteps in a database Use hierarchical Peano-Hilbert curve index within each time step, atoms are 64^3 cells Database is tens of TB => build a cluster of cheap servers Currently 100TB in place, loader works Implementing low-level analysis tools inside DB Interpolation schemes done, FFTW ported inside Yukon Cheap disks => redundant data storage in dual space
36 Wireless Sensor Networks Collaboration with K. Szlavecz, A. Terzis, J. Gray, S. Ozer Building 200 node network to measure soil moisture for environmental monitoring Expect 200 million measurements /yr Deriving from the SkyServer template we were able to build and end-to-end system in less than two weeks Built a OLAP datacube, conditional sums along multiple dimensional axes
37 Summary Data growing exponentially Analyzing so much data requires a new model More data coming: Petabytes/year by 2010 Need scalable solutions Move analysis to the data Spatial and temporal features essential 20 queries! Data explosion is coming from inexpensive sensors Same thing happening in all sciences High energy physics, genomics, cancer research, medical imaging, oceanography, remote sensing, escience: an emerging new branch of science
Astrophysics with Terabytes. Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research
Astrophysics with Terabytes Alex Szalay The Johns Hopkins University Jim Gray Microsoft Research Living in an Exponential World Astronomers have a few hundred TB now 1 pixel (byte) / sq arc second ~ 4TB
More informationThe SDSS SkyServer and beyond. Alex Szalay
The SDSS SkyServer and beyond Alex Szalay Historical Background The Sloan Digital Sky Survey (SDSS) The Cosmic Genome Project 5 color images of ¼ of the sky Pictures of 300 million celestial objects Distances
More informationHow Simulations and Databases Play Nicely. Alex Szalay, JHU Gerard Lemson, MPA
How Simulations and Databases Play Nicely Alex Szalay, JHU Gerard Lemson, MPA An Exponential World Scientific data doubles every year caused by successive generations of inexpensive sensors + exponentially
More informationSharing SDSS Data with the World
The Sloan Digital Sky Survey Sharing SDSS Data with the World Ani Thakar Jordan Raddick Center for Astrophysical Sciences, The Johns Hopkins University Outline Sharing Astronomy Data SDSS Overview Data
More informationVIRTUAL OBSERVATORY TECHNOLOGIES
VIRTUAL OBSERVATORY TECHNOLOGIES / The Johns Hopkins University Moore s Law, Big Data! 2 Outline 3 SQL for Big Data Computing where the bytes are Database and GPU integration CUDA from SQL Data intensive
More informationWeb Services for the Virtual Observatory
Web Services for the Virtual Observatory Alexander S. Szalay, Johns Hopkins University Tamás Budavári, Johns Hopkins University Tanu Malik, Johns Hopkins University Jim Gray, Microsoft Research Ani Thakar,
More informationFrequently Asked Questions
Frequently Asked Questions Questions about SkyServer 1. What is SkyServer? 2. What is the Sloan Digital Sky Survey? 3. How do I get around the site? 4. What can I see on SkyServer? 5. Where in the sky
More informationData-Intensive Science Using GPUs. Alex Szalay, JHU
Data-Intensive Science Using GPUs Alex Szalay, JHU Data in HPC Simulations HPC is an instrument in its own right Largest simulations approach petabytes from supernovae to turbulence, biology and brain
More informationExtending the SDSS Batch Query System to the National Virtual Observatory Grid
Extending the SDSS Batch Query System to the National Virtual Observatory Grid María A. Nieto-Santisteban, William O'Mullane Nolan Li Tamás Budavári Alexander S. Szalay Aniruddha R. Thakar Johns Hopkins
More informationdan.fay@microsoft.com Scientific Data Intensive Computing Workshop 2004 Visualizing and Experiencing E 3 Data + Information: Provide a unique experience to reduce time to insight and knowledge through
More informationIntroduction to Grid Computing
Milestone 2 Include the names of the papers You only have a page be selective about what you include Be specific; summarize the authors contributions, not just what the paper is about. You might be able
More informationHarnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets
Page 1 of 5 1 Year 1 Proposal Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets Year 1 Progress Report & Year 2 Proposal In order to setup the context for this progress
More informationExtreme Data-Intensive Scientific Computing on GPUs. Alex Szalay JHU
Extreme Data-Intensive Scientific Computing on GPUs Alex Szalay JHU An Exponential World 1000 Scientific data doubles every year 100 caused by successive generations of inexpensive sensors + exponentially
More informationExtreme Data-Intensive. Alex Szalay The Johns Hopkins University
Extreme Data-Intensive Computing Alex Szalay The Johns Hopkins University Living in an Exponential World Scientific data doubles every year caused by successive generations of inexpensive sensors + exponentially
More informationSummary of Data Management Principles
Large Synoptic Survey Telescope (LSST) Summary of Data Management Principles Steven M. Kahn LPM-151 Latest Revision: June 30, 2015 Change Record Version Date Description Owner name 1 6/30/2015 Initial
More informationDatabases in the Cloud
Databases in the Cloud Ani Thakar Alex Szalay Nolan Li Center for Astrophysical Sciences and Institute for Data Intensive Engineering and Science (IDIES) The Johns Hopkins University Cloudy with a chance
More informationExploiting Virtual Observatory and Information Technology: Techniques for Astronomy
Exploiting Virtual Observatory and Information Technology: Techniques for Astronomy Nicholas Walton AstroGrid Project Scientist Institute of Astronomy, The University of Cambridge Lecture #3 Goal: Applications
More informationSTREAMING ALGORITHMS. Tamás Budavári / Johns Hopkins University ANALYSIS OF ASTRONOMY IMAGES & CATALOGS 10/26/2015
STREAMING ALGORITHMS ANALYSIS OF ASTRONOMY IMAGES & CATALOGS 10/26/2015 / Johns Hopkins University Astronomy Changed! Always been data-driven But we used to know the sources by heart! Today large collections
More informationHigh-Energy Physics Data-Storage Challenges
High-Energy Physics Data-Storage Challenges Richard P. Mount SLAC SC2003 Experimental HENP Understanding the quantum world requires: Repeated measurement billions of collisions Large (500 2000 physicist)
More informationTime Domain Alerts from LSST & ZTF
Time Domain Alerts from LSST & ZTF Eric Bellm University of Washington LSST DM Alert Production Science Lead ZTF Project Scientist -> ZTF Survey Scientist for the LSST Data Management Team and the ZTF
More informationThe Mission of IDIES
2018 Alex Szalay The Mission of IDIES Intellectual leadership in the Science of Big Data, together with MINDS Incubator for data intensive discoveries through disruptive assistance, increase JHU agility
More informationCollaborative data-driven science. Collaborative data-driven science. Mike Rippin
Collaborative data-driven science Mike Rippin Background and History of SciServer Major Objectives Current System SciServer Compute Now SciServer Compute Future Q&A 2 Collaborative data-driven science
More informationAdaptive selfcalibration for Allen Telescope Array imaging
Adaptive selfcalibration for Allen Telescope Array imaging Garrett Keating, William C. Barott & Melvyn Wright Radio Astronomy laboratory, University of California, Berkeley, CA, 94720 ABSTRACT Planned
More informationdan.fay@microsoft.com http://research.microsoft.com A Tidal Wave of Scientific Data Experimental Science Theoretical Science Newton s Laws, Maxwell s Equations Computational Science Simulation of complex
More informationYogesh Simmhan. escience Group Microsoft Research
External Research Yogesh Simmhan Group Microsoft Research Catharine van Ingen, Roger Barga, Microsoft Research Alex Szalay, Johns Hopkins University Jim Heasley, University of Hawaii Science is producing
More informationSome Reflections on Advanced Geocomputations and the Data Deluge
Some Reflections on Advanced Geocomputations and the Data Deluge J. A. Rod Blais Dept. of Geomatics Engineering Pacific Institute for the Mathematical Sciences University of Calgary, Calgary, AB www.ucalgary.ca/~blais
More informationThe IPAC Research Archives. Steve Groom IPAC / Caltech
The IPAC Research Archives Steve Groom IPAC / Caltech IPAC overview The Infrared Processing and Analysis Center (IPAC) at Caltech is dedicated to science operations, data archives, and community support
More informationSDSS Dataset and SkyServer Workloads
SDSS Dataset and SkyServer Workloads Overview Understanding the SDSS dataset composition and typical usage patterns is important for identifying strategies to optimize the performance of the AstroPortal
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 25: Parallel Databases CSE 344 - Winter 2013 1 Announcements Webquiz due tonight last WQ! J HW7 due on Wednesday HW8 will be posted soon Will take more hours
More informationData publication and discovery with Globus
Data publication and discovery with Globus Questions and comments to outreach@globus.org The Globus data publication and discovery services make it easy for institutions and projects to establish collections,
More informationATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V
ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V WHITE PAPER Create the Data Center of the Future Accelerate
More informationDistributed Archive System for the Cherenkov Telescope Array
Distributed Archive System for the Cherenkov Telescope Array RIA-653549 Eva Sciacca, S. Gallozzi, A. Antonelli, A. Costa INAF, Astrophysical Observatory of Catania INAF, Astronomical Observatory of Rome
More informationThe Virtual Observatory and the IVOA
The Virtual Observatory and the IVOA The Virtual Observatory Emergence of the Virtual Observatory concept by 2000 Concerns about the data avalanche, with in mind in particular very large surveys such as
More information2013 AWS Worldwide Public Sector Summit Washington, D.C.
2013 AWS Worldwide Public Sector Summit Washington, D.C. EMR for Fun and for Profit Ben Butler Sr. Manager, Big Data butlerb@amazon.com @bensbutler Overview 1. What is big data? 2. What is AWS Elastic
More informationModern Data Warehouse The New Approach to Azure BI
Modern Data Warehouse The New Approach to Azure BI History On-Premise SQL Server Big Data Solutions Technical Barriers Modern Analytics Platform On-Premise SQL Server Big Data Solutions Modern Analytics
More informationGiovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France
Giovanni Lamanna LAPP - Laboratoire d'annecy-le-vieux de Physique des Particules, Université de Savoie, CNRS/IN2P3, Annecy-le-Vieux, France ERF, Big data & Open data Brussels, 7-8 May 2014 EU-T0, Data
More informationBatch is back: CasJobs, serving multi-tb data on the Web
Batch is back: CasJobs, serving multi-tb data on the Web William O Mullane, Nolan Li, María Nieto-Santisteban, Alex Szalay, Ani Thakar The Johns Hopkins University Jim Gray Microsoft Research October 2005
More informationCosmic Peta-Scale Data Analysis at IN2P3
Cosmic Peta-Scale Data Analysis at IN2P3 Fabrice Jammes Scalable Data Systems Expert LSST Database and Data Access Software Developer Yvan Calas Senior research engineer LSST deputy project leader at CC-IN2P3
More informationThe Virtual Observatory in Australia Connecting to International Initiatives. Peter Lamb. CSIRO Mathematical & Information Sciences
The Virtual Observatory in Australia Connecting to International Initiatives Peter Lamb CSIRO Mathematical & Information Sciences The Grid & escience Convergence of high-performance computing, huge data
More informationCenter for Advanced Computing Research
Center for Advanced Computing Research DANSE Kickoff Meeting Mark Stalzer stalzer@caltech.edu August 15, 2006 CACR Mission and Partners Creating advanced computing methods to accelerate scientific discovery
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationVisualization & the CASA Viewer
Visualization & the Viewer Juergen Ott & the team Atacama Large Millimeter/submillimeter Array Expanded Very Large Array Robert C. Byrd Green Bank Telescope Very Long Baseline Array Visualization Goals:
More informationWorkload-Aware Data Partitioning in CommunityDriven Data Grids
Workload-Aware Data Partitioning in CommunityDriven Data Grids Tobias Scholl, Bernhard Bauer, Jessica Müller, Benjamin Gufler, Angelika Reiser, and Alfons Kemper Department of Computer Science, Germany
More informationASKAP Pipeline processing and simulations. Dr Matthew Whiting ASKAP Computing, CSIRO May 5th, 2010
ASKAP Pipeline processing and simulations Dr Matthew Whiting ASKAP Computing, CSIRO May 5th, 2010 ASKAP Computing Team Members Team members Marsfield: Tim Cornwell, Ben Humphreys, Juan Carlos Guzman, Malte
More informationTHE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel
THE NATIONAL DATA SERVICE(S) & NDS CONSORTIUM A Call to Action for Accelerating Discovery Through Data Services we can Build Ed Seidel National Center for Supercomputing Applications University of Illinois
More informationExperiences Running OGSA-DQP Queries Against a Heterogeneous Distributed Scientific Database
2009 15th International Conference on Parallel and Distributed Systems Experiences Running OGSA-DQP Queries Against a Heterogeneous Distributed Scientific Database Helen X Xiang Computer Science, University
More informationData Processing for SUBARU Telescope using GRID
Data Processing for SUBARU Telescope using GRID Y. Shirasaki M. Tanaka S. Kawanomoto S. Honda M. Ohishi Y. Mizumoto National Astronomical Observatory of Japan, 2-21-1, Osawa, Mitaka, Tokyo 181-8588, Japan
More informationBig Data - Some Words BIG DATA 8/31/2017. Introduction
BIG DATA Introduction Big Data - Some Words Connectivity Social Medias Share information Interactivity People Business Data Data mining Text mining Business Intelligence 1 What is Big Data Big Data means
More informationEfficient classification of billions of points into complex geographic regions using hierarchical triangular mesh
Efficient classification of billions of points into complex geographic regions using hierarchical triangular mesh Dániel Kondor 1, László Dobos 1, István Csabai 1, András Bodor 1, Gábor Vattay 1, Tamás
More informationData Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009
Data Intensive Computing SUBTITLE WITH TWO LINES OF TEXT IF NECESSARY PASIG June, 2009 Presenter s Name Simon CW See Title & and Division HPC Cloud Computing Sun Microsystems Technology Center Sun Microsystems,
More informationScalable and Practical Probability Density Estimators for Scientific Anomaly Detection
Scalable PDEs p.1/107 Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection Dan Pelleg Andrew Moore (chair) Manuela Veloso Geoff Gordon Nir Friedman, the Hebrew University
More information(1) Department of Physics University Federico II, Via Cinthia 24, I Napoli, Italy (2) INAF Astronomical Observatory of Capodimonte, Via
(1) Department of Physics University Federico II, Via Cinthia 24, I-80126 Napoli, Italy (2) INAF Astronomical Observatory of Capodimonte, Via Moiariello 16, I-80131 Napoli, Italy To measure the distance
More informationBig Computing and the Mitchell Institute for Fundamental Physics and Astronomy. David Toback
Big Computing and the Mitchell Institute for Fundamental Physics and Astronomy Texas A&M Big Data Workshop October 2011 January 2015, Texas A&M University Research Topics Seminar 1 Outline Overview of
More informationThe Computation and Data Needs of Canadian Astronomy
Summary The Computation and Data Needs of Canadian Astronomy The Computation and Data Committee In this white paper, we review the role of computing in astronomy and astrophysics and present the Computation
More informationConference The Data Challenges of the LHC. Reda Tafirout, TRIUMF
Conference 2017 The Data Challenges of the LHC Reda Tafirout, TRIUMF Outline LHC Science goals, tools and data Worldwide LHC Computing Grid Collaboration & Scale Key challenges Networking ATLAS experiment
More informationOptimizing Parallel Access to the BaBar Database System Using CORBA Servers
SLAC-PUB-9176 September 2001 Optimizing Parallel Access to the BaBar Database System Using CORBA Servers Jacek Becla 1, Igor Gaponenko 2 1 Stanford Linear Accelerator Center Stanford University, Stanford,
More informationDrawing the Big Picture
Drawing the Big Picture Multi-Platform Data Architectures, Queries, and Analytics Philip Russom TDWI Research Director for Data Management August 26, 2015 Sponsor 2 Speakers Philip Russom TDWI Research
More informationEuclid Archive Science Archive System
Euclid Archive Science Archive System Bruno Altieri Sara Nieto, Pilar de Teodoro (ESDC) 23/09/2016 Euclid Archive System Overview The EAS Data Processing System (DPS) stores the data products metadata
More informationThe Data exacell DXC. J. Ray Scott DXC PI May 17, 2016
The Data exacell DXC J. Ray Scott DXC PI May 17, 2016 DXC Leadership Mike Levine Co-Scientific Director Co-PI Nick Nystrom Senior Director of Research Co-PI Ralph Roskies Co-Scientific Director Co-PI Robin
More informationarxiv: v1 [cs.db] 2 Oct 2014
arxiv:1410.0709v1 [cs.db] 2 Oct 2014 Efficient classification of billions of points into complex geographic regions using hierarchical triangular mesh Dániel Kondor, László Dobos, István Csabai, András
More informationFIFI-LS: Basic Cube Analysis using SOSPEX
FIFI-LS: Basic Cube Analysis using SOSPEX Date: 1 Oct 2018 Revision: - CONTENTS 1 INTRODUCTION... 1 2 INGREDIENTS... 1 3 INSPECTING THE CUBE... 3 4 COMPARING TO A REFERENCE IMAGE... 5 5 REFERENCE VELOCITY
More informationPublic Sensing Using Your Mobile Phone for Crowd Sourcing
Institute of Parallel and Distributed Systems () Universitätsstraße 38 D-70569 Stuttgart Public Sensing Using Your Mobile Phone for Crowd Sourcing 55th Photogrammetric Week September 10, 2015 Stuttgart,
More informationA Tour of LSST Data Management. Kian- Tat Lim DM Project Engineer and System Architect
A Tour of LSST Data Management Kian- Tat Lim DM Project Engineer and System Architect Welcome Aboard Choo Yut Shing @flickr, CC BY-NC-SA 2.0 2 What We Do Accept and archive images and metadata Generate
More informationIntroduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 24: Parallel Databases CSE 414 - Spring 2015 1 Announcements HW7 due Wednesday night, 11 pm Quiz 7 due next Friday(!), 11 pm HW8 will be posted middle of
More informationMATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by
1 MATLAB is a multi-paradigm numerical computing environment fourth-generation programming language. A proprietary programming language developed by MathWorks In 2004, MATLAB had around one million users
More informationMassive Data Analysis
Professor, Department of Electrical and Computer Engineering Tennessee Technological University February 25, 2015 Big Data This talk is based on the report [1]. The growth of big data is changing that
More informationALMA REMOTE MINING EXPERIMENT ARTEMIX. Yaye Awa Ba, Philippe. Salomé, Michel. Caillat (LERMA) with credits to : L. Loria, N.
ALMA REMOTE MINING EXPERIMENT Yaye Awa Ba, Philippe. Salomé, Michel. Caillat (LERMA) with credits to : L. Loria, N. Kasradze Archive and Data Mining Goals (i) Search by products not by instrumental configuration
More informationExisting Tools in HEP and Particle Astrophysics
Existing Tools in HEP and Particle Astrophysics Richard Dubois richard@slac.stanford.edu R.Dubois Existing Tools in HEP and Particle Astro 1/20 Outline Introduction: Fermi as example user Analysis Toolkits:
More informationThe NOAO Data Lab Design, Capabilities and Community Development. Michael Fitzpatrick for the Data Lab Team
The NOAO Data Lab Design, Capabilities and Community Development Michael Fitzpatrick for the Data Lab Team What is it? Data Lab is Science Exploration Platform that provides:! Repository for large datasets
More information1 INTRODUCTION ABSTRACT
The NOAO Data Laboratory: A conceptual overview Michael J. Fitzpatrick *, Knut Olsen, Frossie Economou, Elizabeth B. Stobie, T.C. Beers, Mark Dickinson, Patrick Norris, Abi Saha, Robert Seaman, David R.
More informationStrategic Briefing Paper Big Data
Strategic Briefing Paper Big Data The promise of Big Data is improved competitiveness, reduced cost and minimized risk by taking better decisions. This requires affordable solution architectures which
More informationDivide & Recombine with Tessera: Analyzing Larger and More Complex Data. tessera.io
1 Divide & Recombine with Tessera: Analyzing Larger and More Complex Data tessera.io The D&R Framework Computationally, this is a very simple. 2 Division a division method specified by the analyst divides
More informationEvolution of Database Systems
Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second
More informationThe Portal Aspect of the LSST Science Platform. Gregory Dubois-Felsmann Caltech/IPAC. LSST2017 August 16, 2017
The Portal Aspect of the LSST Science Platform Gregory Dubois-Felsmann Caltech/IPAC LSST2017 August 16, 2017 1 Purpose of the LSST Science Platform (LSP) Enable access to the LSST data products Enable
More informationNational Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology WISE Archive.
Bruce Berriman / Steve Groom Infrared Science Archive (IRSA), IPAC/Caltech GBB/SLG - 1 WSDC Functional Block Diagram White Sands JPL UCLA HRP H/K MOS Maneuvers SOC Science Team Images FTP Site Ancillary
More informationData, Data, Everywhere. We are now in the Big Data Era.
Data, Data, Everywhere. We are now in the Big Data Era. CONTENTS Background Big Data What is Generating our Big Data Physical Management of Big Data Optimisation in Data Processing Techniques for Handling
More informationScientific Data Curation and the Grid
Scientific Data Curation and the Grid David Boyd CLRC e-science Centre http://www.e-science.clrc.ac.uk/ d.r.s.boyd@rl.ac.uk 19 October 2001 Digital Curation Seminar 1 Outline Some perspectives on scientific
More informationCC-IN2P3 / NCSA Meeting May 27-28th,2015
The IN2P3 LSST Computing Effort Dominique Boutigny (CNRS/IN2P3 and SLAC) on behalf of the IN2P3 Computing Team CC-IN2P3 / NCSA Meeting May 27-28th,2015 OSG All Hands SLAC April 7-9, 2014 1 LSST Computing
More informationChapter 6. Foundations of Business Intelligence: Databases and Information Management VIDEO CASES
Chapter 6 Foundations of Business Intelligence: Databases and Information Management VIDEO CASES Case 1a: City of Dubuque Uses Cloud Computing and Sensors to Build a Smarter, Sustainable City Case 1b:
More informationDATA MINING II - 1DL460
DATA MINING II - 1DL460 Spring 2016 A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationApplication Example Running on Top of GPI-Space Integrating D/C
Application Example Running on Top of GPI-Space Integrating D/C Tiberiu Rotaru Fraunhofer ITWM This project is funded from the European Union s Horizon 2020 Research and Innovation programme under Grant
More informationPreprocessing, Multiphase Inference, and Massive Data in Theory and Practice
Preprocessing, Multiphase Inference, and Massive Data in Theory and Practice Alexander W Blocker Department of Statistics Harvard University MMDS 2012 July 13, 2012 Joint work with Xiao-Li Meng Alexander
More informationStarting small to go Big: Building a Living Database
Starting small to go Big: Building a Living Database Michael Sabbatino 1,2, Baker, D.V. Vic 3,4, Rose, K. 1, Romeo, L. 1,2, Bauer, J. 1, and Barkhurst, A. 3,4 1 US Department of Energy, National Energy
More informationCS639: Data Management for Data Science. Lecture 1: Intro to Data Science and Course Overview. Theodoros Rekatsinas
CS639: Data Management for Data Science Lecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 Big science is data driven. 3 Increasingly many companies see themselves as data driven.
More informationThe Canadian CyberSKA Project
The Canadian CyberSKA Project A. G. Willis (on behalf of the CyberSKA Project Team) National Research Council of Canada Herzberg Institute of Astrophysics Dominion Radio Astrophysical Observatory May 24,
More informationescience in the Cloud Dan Fay Director Earth, Energy and Environment
escience in the Cloud Dan Fay Director Earth, Energy and Environment dan.fay@microsoft.com New ways to analyze and communicate data EOS Article: Mountain Hydrology, Snow Color, and the Fourth Paradigm
More informationAppliances and DW Architecture. John O Brien President and Executive Architect Zukeran Technologies 1
Appliances and DW Architecture John O Brien President and Executive Architect Zukeran Technologies 1 OBJECTIVES To define an appliance Understand critical components of a DW appliance Learn how DW appliances
More informationDesign and Implementation of the Japanese Virtual Observatory (JVO) system Yuji SHIRASAKI National Astronomical Observatory of Japan
Design and Implementation of the Japanese Virtual Observatory (JVO) system Yuji SHIRASAKI National Astronomical Observatory of Japan 1 Introduction What can you do on Japanese Virtual Observatory (JVO)?
More informationStreaming Massive Environments From Zero to 200MPH
FORZA MOTORSPORT From Zero to 200MPH Chris Tector (Software Architect Turn 10 Studios) Turn 10 Internal studio at Microsoft Game Studios - we make Forza Motorsport Around 70 full time staff 2 Why am I
More informationCosmic Peta-Scale Data Analysis at IN2P3
Cosmic Peta-Scale Data Analysis at IN2P3 Fabrice Jammes Scalable Data Systems Expert LSST Database and Data Access Software Developer Yvan Calas Senior research engineer LSST deputy project leader at CC-IN2P3
More informationCollaborative data-driven science. Collaborative data-driven science
Alex Szalay ! Started with the SDSS SkyServer! Built very quickly in 2001! Goal: instant access to rich content! Idea: bring the analysis to the data! Interac
More informationCS 6240: Parallel Data Processing in MapReduce: Module 1. Mirek Riedewald
CS 6240: Parallel Data Processing in MapReduce: Module 1 Mirek Riedewald Why Parallel Processing? Answer 1: Big Data 2 How Much Information? Source: http://www2.sims.berkeley.edu/research/projects/ho w-much-info-2003/execsum.htm
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lectures 23 and 24 Parallel Databases 1 Why compute in parallel? Most processors have multiple cores Can run multiple jobs simultaneously Natural extension of txn
More informationDennis Gannon Data Center Futures Microsoft Research
Dennis Gannon Data Center Futures Microsoft Research The question What is role of the private sector data center in the national science agenda? The Quest to Broaden Access Science as a Web Service Science
More informationMicrosoft Developer Day
Microsoft Developer Day Pradeep Menon Microsoft Developer Day Solutions Architect Agenda Microsoft Developer Day Traditional Business Intelligence Architecture Structured Sources Extract Transform Structurize
More informationStorage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster
Storage and Compute Resource Management via DYRE, 3DcacheGrid, and CompuStore Ioan Raicu, Ian Foster. Overview Both the industry and academia have an increase demand for good policies and mechanisms to
More information5 Fundamental Strategies for Building a Data-centered Data Center
5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse
More informationInvesting in a Better Storage Environment:
Investing in a Better Storage Environment: Best Practices for the Public Sector Investing in a Better Storage Environment 2 EXECUTIVE SUMMARY The public sector faces numerous and known challenges that
More informationA Data Diffusion Approach to Large Scale Scientific Exploration
A Data Diffusion Approach to Large Scale Scientific Exploration Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Joint work with: Yong Zhao: Microsoft Ian Foster:
More informationLarge Scale Data Management of Astronomical Surveys with AstroSpark
Large Scale Data Management of Astronomical Surveys with AstroSpark Mariem BRAHEM 1,2, Karine ZEITOUNI 1, Laurent YEH 1 (1) DAVID Lab University of Versailles (2) CNES Centre National d Etudes Spatiale
More information