Jožef Stefan Institute Scalability / Data / Tasks Meeting Scalability Requirements with Large Data and Complex Tasks: Adapting Existing Technologies and Best Practices in Slovenia Jan Jona Javoršek Jožef Stefan Institute jona.javorsek@ijs.si SLING Slovenian Initiative for National Grid http://www.ijs.si/ http://www.sling.si/
Historical Zuse Z 23 CONVEX C3860 CDC Cyber 74 CONVEX C3860 3/29
SLING PRIKLJUČENI PRIKLJUČENI CENTRI CENTRI Arctur* Arctur* 1024 1024 Arnes Arnes 4400 4400 Atos* Atos* 3000 3000 CIPKeBiP CIPKeBiP 990 990 SiGNET SiGNET 4200 4200 UNG UNG 120 120 R4* R4* 1800 1800 NSC NSC 1800 1800 8 sites > 18.000 jeder (> 11.000 ARC-active) > 1PB disk > 4 milion jobs / y HPC, GPGPU, chroot > 80% SLO capacity Candidates Candidates Meteo Meteo 2200 2200 CI CI 2000 2000 ME ME 1050 1050 4/29
SLING users Arnes NREN users Cluster owners* Projects* Individual researchers University professors Student groups *not always ARC 5/29
Use Cases Particle Physics: ATLAS Pierre Auger Theoretical Physics Meteo/Geo Modelling Fluid Dynamics Reactor Physics Simulations 6/29 Pierre Auger Observatory
Use Cases Life Sciences, mostly computational (bio-)chemistry and genomics IJS users (biology, chemistry, knowledge technologies) Collaboration with EMBL Diagnostic genomics ELIXIR 7/29
Use Cases Knowledge technologies Modelling for different fields Genetic alghoriths Big/Web data analyisis Advanced computational linguistic models CLARIN.si 8/29
Steam explosion moment 9/29
Power distribution for Krsko NPP reactor Parallel Monte Carlo simulation of neutron transport, F-8 department 10/29
Innovation? batch system virtualisation network? 11/29
ARC and LRMS (batch system) 12/29
ARC Computing Element 13/29
ARC user accounts 14/29
CVMFS Salt Mix'n'match... CERN Agile model KeyStone OpenCL Ceph Globus NorduGrid ARC glite Cinder PKI VOMS Torque dcache OpenMP SLURM CUDA OpenStack gftp Glance OpenNebula ovirt Puppet science portals VRC 15/29
Software Deployment and Virtualization Admin install Environment Modules Compile job Run Time Environments Install job CHROOTs Shared disk Shared image Containers Docker Shifter 16/29
Storage Basic suport Short-term / local storage Medium-term storage Long term storage 17/29
User-Facing Issues Batch / ARC interface / PKI / VOMS Software installations and use Submission delays, error reporting and debugging MPI scalability difficulties Understanding of job and cluster topology GPGPU use 18/29
Groups and Projects Job and task management scalability Data management task managers Storage and troughputh hardware and cluster setup Oppurtunistic resource use Resource optimization innovative job models 19/29
ATLAS as an example ~100 distributed sites 250k cores used all the time 200PB of storage space 1M jobs/day 2PB of data is transferred per day between computing sites Sites include: WLCG GRID sites, HPCs, Clouds, Volunteer computing 20/29
act: ARC Control Tower Components: act Submitter Status checker Fetcher (app verification) Cleaner External&job& provider App&config App&engine ARC&config ARC&engine Site&1 Site&2 ARC&CE Site&3 ARC&CE Cluster ARC&CE Cluster Cluster App&table ARC&table DB&(Oracle/MySQL) 21/29
Opportunistic Resouce Use Grid clusters HPC clusters Private computers Public (commercial) cloud Microjobs 22/29
ATLAS scaling 2010 Planned data distribution Jobs go to data Multi-hop data flows Poor T2 networking across regions ~20 AOD copies distributed worldwide 23/29
ATLAS scaling 2010 Planned data distribution Jobs go to data Multi-hop data flows Poor T2 networking across regions 2013 Planned & dynamic distribution data Jobs go to data & data to free sites Direct data flows for most of T2s Many T2s connected to 10Gb/s link ~20 AOD copies distributed worldwide 4 AOD copies distributed worldwide 24/29
Social Component Accessibility beyond large projects Long-term funding Perception of public clouds Not invented here syndrome Users with no Unix experience Sustainability pressure 25/29
People Involved Andrej Filipčič, JSI Barbara Krašovec, Arnes, JSI Dejan Lesjak, JSI Janez Srakar, JSI Jan Jona Javoršek, JSI + 4 site administrators National Initiative: http://www.sling.si/ 26/29
Thanks! Questions? 27/29
New Computing Centre 200 m² slightly dislocated New network installation Water cooling Not enough power on-site yet Housing Pikolit, NSC, parts of others Interesting issues on cost sharing... 28/29
New Cluster Grid + HPC GPGPU: 16 x K80 NorduGrid ARC + SLURM Considering EGI Users: IJS departments related research supported EU infrastructures NSC Cluster in Numbers ~1800 cores ~35 TB scratch ~35 TB storage ~8 TB RAM 29/29