Network Support for Data Intensive Science Eli Dart, Network Engineer ESnet Network Engineering Group ARN2 Workshop Washington, DC April 18, 2013
Overview Drivers Sociology Path Forward 4/19/13 2
Exponential Drivers We live in a world governed by exponentials Genomics data set growth (~5x/year) Moore s Law (~2x/18 months) Sensors/Detectors Computing Network devices There is an approximate balance of sorts Challenge: data growth Response: data transport, storage, analysis The balance isn t perfect If the response part flattens out, it s time to get worried 4/19/13 3
Non-Exponential Components Matter Several important parts of the science ecosystem are not exponential Money People (including policies, which are rules among people) Protocols Some of this is accounted for in the balance (e.g. money) People and protocols change very slowly We are currently in a place where the non-exponential components need to change, too 4/19/13 4
Paradigm Shift The transition to data intensive science is about more than data and science People need to change how they think Learning new things takes time This is true of science collaborations as well as the individual scientists A few are able to adapt on their own Most need to import expertise Patterns have emerged from contact with a wide variety of collaborations 4/19/13 5
Rough User Grouping By Data Set Size 100PB 10PB 1PB Small collaboration scale, e.g. light and neutron sources Medium collaboration scale, e.g. HPC codes A few large collaborations have internal software and networking organizations Data Scale 100TB 10TB Large collaboration scale, e.g. LHC 1TB 100GB 10GB Collaboration Scale 4/19/13 6
Rough User Grouping By Data Set Size 100PB 10PB 1PB Small collaboration scale, e.g. light and neutron sources Medium collaboration scale, e.g. HPC codes A few large collaborations have internal software and networking organizations Data Scale 100TB 10TB Large collaboration scale, e.g. LHC 1TB 100GB 10GB Collaboration Scale 4/19/13 7
Rough User Grouping Discussion (1) The chart is a crude generalization It is not meant to describe specific collaborations, but to illustrate some common aspects of many collaborations Data sets are constantly growing (the lines smear to the right) Small data instrument science Light sources, microscopy, nanoscience centers, etc. Typically small number of scientists per collaboration, many many collaborations Individual collaborations typically rely on site support and grad students This group typically has difficulty moving data via the network Science DMZs and Data Transfer Nodes (especially if deployed with Globus Online) are starting to help
Rough User Grouping Discussion (2) Supercomputer simulation science Climate, fusion, bioinformatics, computational astrophysics, etc. Larger collaborations, often multi-site Reliant on supercomputer center staff for help with network issues, or on grad students This group typically has difficulty transferring data via the network - Many users still want to use HPSS directly (often performs poorly) - Data Transfer Nodes are starting to help, especially when deployed with Globus Online Large data instrument science (HEP, NP) Very large collaborations multi-institution, multi-nation-state Collaborations have their own software and networking shops Typically able to use the network well, in some cases expert
Networks Depend On Others We all know that networking is end2end What this really means is that we are interdependent I can t succeed if you fail You can t succeed if I don t do my part Our users must succeed for us to be viewed as successful The Science DMZ model helps People are fixing the edges There is more to do We must proactively help our constituents succeed 4/19/13 10
Differentiation How does a science network differentiate itself? Make the impossible possible Make the difficult routine The commodity world is advancing relentlessly (see exponentials) However, the capability-class niche is probably not worth their time As long as the U.S. is interested in scientific leadership, science networks will be necessary I believe this dynamic holds both for regional and national networks We know our constituents We can innovate if we maintain the flexibility to do so We can build capability-class solutions for specific purposes They don t need to scale to 100M users If they scale to the right 10 experiments, there s a Nobel Prize However, this won t happen by itself we must shape our own destiny 4/19/13 11
Questions? Thanks! Eli Dart - dart@es.net http://www.es.net/ http://fasterdata.es.net/