Powering)the)Road)to)National)HPC)Leadership)

Size: px

Start display at page:

Download "Powering)the)Road)to)National)HPC)Leadership)"

Audra Pope
5 years ago
Views:

1 Powering)the)Road)to)National)HPC)Leadership) Jack%C.%Wells,%Director%of%Science Oak$Ridge$Leadership$Computing$Facility/Oak$Ridge$National$Laboratory Join%the%Conversation%#OpenPOWERSummit

Oak%Ridge%National%Laboratory 2018%OpenPOWER Summit Las%Vegas

2 Powering$the$Road$to$National$HPC$ Leadership$ Jack%C.%Wells Director%of%Science Oak%Ridge%Leadership%Computing%Facility Oak%Ridge%National%Laboratory 2018%OpenPOWER Summit Las%Vegas 19%March%2018 ORNL%is%managed%by%UT2Battelle% for%the%us%department%of%energy This%research%used%resources%of%the%Oak%Ridge%Leadership%Computing%Facility%at%the%Oak% Ridge%National%Laboratory,%which%is%supported%by%the%Office%of%Science%of%the%U.S.% Department%of%Energy%under%Contract%No.%DE2AC05200OR22725.%Some%of%the%work% presented%here%is%from%the%total%and%oak%ridge%national%laboratory%collaboration%which% is%done%under%the%crada%agreement%nfe %some%of%the%experiments%were% supported%by%an%allocation%of%advanced%computing%resources%provided%by%the%national% Science%Foundation.%The%computations%were%performed%on%Nautilus%at%the%National% Institute%for%Computational%Sciences.

3 A"Little"About"ORNL Oak$Ridge$National$ Laboratory$is$the$ largest$us$ Department$of$ Energy$(DOE)$open$ science$laboratory$ Oak Ridge, Tennessee

What$is$a$Leadership$Computing$Facility$(LCF)?

resources%required%to%solve%the%most% challenging%problems.

and%growing%computational%needs%of%the% scientific%community

4 What$is$a$Leadership$Computing$Facility$(LCF)? Collaborative%DOE%Office%of%Science%user2 facility%program%at%ornl%and%anl Mission:%Provide%the%computational%and%data% resources%required%to%solve%the%most% challenging%problems. 22centers/22architectures%to%address%diverse% and%growing%computational%needs%of%the% scientific%community Highly%competitive%user%allocation%programs% (INCITE,%ALCC). Projects%receive%10x%to%100x%more%resource% than%at%other%generally%available%centers. LCF%centers%partner%with%users%to%enable% science%&%engineering%breakthroughs% (Liaisons,%Catalysts).

5 OLCF23 ORNL$has$systematically$delivered$a$series$ of$leadershipeclass$systems On%scope% %On%budget% %Within%schedule Titan,%five%years%old%in%October%2017,%continues% to%deliver%world2class%science%research%in%support% of%our%user%community.%we%will%operate%titan% through%2019%when%it%will%be%decommissioned. OLCF21 OLCF fold improvement in%8%years 2012 Cray%XK7% Titan 27 PF 18.5 TF 25% TF 54% TF 62% TF 263% TF 1% PF 2.5 PF 2004 Cray%X1E% Phoenix% 2005 Cray%XT3% Jaguar 2006 Cray%XT3% Jaguar 2007 Cray%XT4% Jaguar 2008 Cray%XT4% Jaguar 2008 Cray%XT5% Jaguar 2009 Cray%XT5% Jaguar

6 We$are$building$on$this$record$of$success$ to$enable$exascale in$2021 OLCF24 OLCF25 ~1 EF 27 PF 2012 Cray%XK7% Titan 200 PF 2018 IBM% Summit 5002fold improvement in%9%years 2021 Frontier

Coming$in$2018:$Summit$will$replace$Titan$

7 Coming$in$2018:$Summit$will$replace$Titan$ as$the$olcf s$leadership$supercomputer$ Summit,%slated%to%be%more%powerful%than%any%other%existing% supercomputer,%is%the%department%of%energy s%oak%ridge%national% Laboratory s%newest%supercomputer%for%open%science.

Summit$Overview Compute$Node Compute$Rack 18%Compute%Servers Warm%water%(70

10.2$PB$Total$Memory 256%compute%racks 4,608%compute%nodes Mellanox

4%Threads/core NVLink 2%x%POWER9 6%x%NVIDIA%GV100 NVMe2compatible%PCIe

! 25%GB/s%EDR%IB2 (2%ports) 512%GB%DRAM2 (DDR4) 96%GB%HBM2 (3D%Stacked)

8 Summit$Overview Compute$Node Compute$Rack 18%Compute%Servers Warm%water%(70 F%direct2cooled% components) RDHX%for%air2cooled%components Compute$System 10.2$PB$Total$Memory 256%compute%racks 4,608%compute%nodes Mellanox EDR%IB%fabric 200%PFLOPS ~13%MW% Components IBM$POWER9 22%Cores 4%Threads/core NVLink 2%x%POWER9 6%x%NVIDIA%GV100 NVMe2compatible%PCIe 1600%GB%SSD%!! 25%GB/s%EDR%IB2 (2%ports) 512%GB%DRAM2 (DDR4) 96%GB%HBM2 (3D%Stacked) Coherent%Shared%Memory 39.7%TB%Memory/rack 55%KW%max%power/rack GPFS$File$System 250$PB$storage 2.5%TB/s%read,%2.5%TB/s%write NVIDIA$GV100 7%TF NVLink

Summit$Node$Overview HBM 16 GB 900 GB/s GPU 7 TF DRAM 256 GB DRAM 256 GB HBM 16 GB

GB/s 135 GB/s P9 16 GB/s 64 GB/s 135 GB/s 16 GB/s P9 50 GB/s 50 GB/s 50 GB/s HBM 16

5 GB/s 50 GB/s 900 GB/s 900 GB/s NVM 6.0 GB/s Read 2.

9 Summit$Node$Overview HBM 16 GB 900 GB/s GPU 7 TF DRAM 256 GB DRAM 256 GB HBM 16 GB 900 GB/s GPU 7 TF 50 GB/s 50 GB/s HBM 16 GB 50 GB/s 900 GB/s GPU 7 TF 50 GB/s 50 GB/s 135 GB/s P9 16 GB/s 64 GB/s 135 GB/s 16 GB/s P9 50 GB/s 50 GB/s 50 GB/s HBM 16 GB GPU 7 TF HBM 16 GB 50 GB/s GPU 7 TF 900 GB/s NIC HBM 16 GB GPU 7 TF 12.5 GB/s 12.5 GB/s 50 GB/s 900 GB/s 900 GB/s NVM 6.0 GB/s Read 2.2 GB/s Write TF 42 TF (6x7 TF) HBM 96 GB (6x16 GB) DRAM 512 GB (2x16x16 GB) NET 25 GB/s (2x12.5 GB/s) MMsg/s 83 HBM/DRAM Bus (aggregate B/W) NVLINK X-Bus (SMP) PCIe Gen4 EDR IB HBM & DRAM speeds are aggregate (Read+Write). All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional.

Coming$in$2018:$Summit$will$replace$Titan$ as$the$olcf s$leadership$supercomputer$ Feature Titan Summit Application Performance Baseline 5210x%Titan Number%of%Nodes 18,688 4,608 Many%fewer%nodes

4%TF 42%TF Memory per%node 32 GB DDR3%+%6%GB%GDDR5 512%GB%DDR4%+%96%GB%HBM2 NV%memory per%node 0 1600%GB Total%System%Memory 710%TB >10%PB%DDR4%+%HBM2%+ Non2volatile System%Interconnect Gemini%(6.

10 Coming$in$2018:$Summit$will$replace$Titan$ as$the$olcf s$leadership$supercomputer$ Feature Titan Summit Application Performance Baseline 5210x%Titan Number%of%Nodes 18,688 4,608 Many%fewer%nodes Much%more%powerful%nodes Much%more%memory%per%node% and%total%system%memory Faster%interconnect Much%higher%bandwidth% between%cpus%and%gpus Much%larger%and%faster%file% system Node%performance 1.4%TF 42%TF Memory per%node 32 GB DDR3%+%6%GB%GDDR5 512%GB%DDR4%+%96%GB%HBM2 NV%memory per%node %GB Total%System%Memory 710%TB >10%PB%DDR4%+%HBM2%+ Non2volatile System%Interconnect Gemini%(6.4%GB/s) Dual%Rail%EDR2IB (25%GB/s) Interconnect%Topology 3D Torus Non2blocking%Fat%Tree Bi2Section%Bandwidth 15.6%TB/s TB/s Processors 1%AMD%Opteron 1%NVIDIA%Kepler 2%IBM%POWER9 6%NVIDIA Volta File%System 32%PB,%1%TB/s, Lustre 250 PB,%2.5%TB/s,%GPFS Power%Consumption 9%MW 13%MW

jointly%procure%these%systems,%and%in%so%doing,%align%strategy%and%resources% across%the%doe%enterprise.

11 What$is$CORAL?$ The$program$through$which$Summit$&$Sierra$are$procured. Several%DOE%labs%have%strong%supercomputing%programs%and%facilities.% To%bring%the%next%generation%of%leading%supercomputers%to%these%labs,%DOE% created%coral%(the%collaboration%of%oak%ridge,%argonne,%and%livermore)%to% jointly%procure%these%systems,%and%in%so%doing,%align%strategy%and%resources% across%the%doe%enterprise. Collaboration%grouping%of%DOE%labs%was%done%based%on%common%acquisition% timings.%collaboration%is%a%win2win%for%all%parties.% Summit %System Sierra %System OpenPOWER Technologies:%IBM%POWER%CPUs,%NVIDIA%Tesla%GPUs,%Mellanox EDR%100Gb/s%InfiniBand Paving%The%Road%to%Exascale%Performance

12 OLCF$Program$to$Ready$Application$ Developers$and$Users We%are%preparing%users%through: Application%Readiness%and%Early%Science%through%Center%for%Accelerated% Application%Readiness%(CAAR) Training%and%web2based%%documentation Early%access%on%SummitDev and%summit%phase%i%system%(already%accepted) Access%for%broader%user%base%on%final,%accepted%Phase%II%system Goals:% Early%science%achievements,% Demonstrate%application%readiness,% Prepare%INCITE%&%ALCC%proposals,% Harden%Summit%for%full2user%operations

13 Summit$Early$Science$Program$(ESP)$ We%put%out%a%Call%for%Proposals%in%December%2017 Resulted%in%62%Letters%of%Intent%(LOI)%received%by%year s%end. 27%are%from%PIs%at%universities 32%are%from%PIs%at%national%laboratories%or%research%institutions%(DOE,%NASA)% 14%are%CAAR%project2related%LOIs 27%have%had%past%INCITE%allocations 9%have%had%past%ALCC%allocations 15%have%connections%to%the%US%DOE%Exascale%Computing%Project 9%are%AI%or%deep%learning2related% Proposals%are%due%at%the%beginning%of%June ESP%Users%will%gain%full%access%to%Summit%for%early%science%later%this%year

%here s%why: GPU$Brawn:$Summit%links%more%than%27,000%deep2learning% optimized%nvidia%gpus%with%the%potential%to%deliver% exascale2level%performance%(a%billion2billion%calculations%per%

14 Summit$will$be$the$world s$smartest$ supercomputer$for$open$science But%what%makes%a%supercomputer%smart? Summit%provides%unprecedented%opportunities%for%the%integration% of%artificial%intelligence%(ai)%and%scientific%discovery.%here s%why: GPU$Brawn:$Summit%links%more%than%27,000%deep2learning% optimized%nvidia%gpus%with%the%potential%to%deliver% exascale2level%performance%(a%billion2billion%calculations%per% second)%for%ai%applications. HighEspeed$Data$Movement:$NVLink high2bandwidth% technology%built%into%all%of%summit s%processors%supplies%the% next2generation% information%superhighways %needed%to%train% deep%learning%algorithms%for%challenging%science%problems% quickly. Memory$Where$it$Matters:%Summit s%sizable%local%memory% gives%ai%researchers%a%convenient%launching%point%for%data2 intensive%tasks,%an%asset%that%allows%for%faster%ai%training%and% greater%algorithmic%accuracy. One%of%Summit s%4,600%ibm%ac922%nodes.%each%node% contains%six%nvidia%volta%gpus%and%two%ibm%power9% CPUs,%giving%scientists%new%opportunities%to%automate,% accelerate%and%drive%understanding%using%artificial% intelligence%techniques.

Summit$will$be$the$world s$smartest$ supercomputer$for$open$science But%what%can%a%smart%supercomputer%do?

By%training%AI%algorithms%to%predict%material% properties%from%experimental%data,%

for%better%batteries,%more%resilient%building% materials,%and%more%efficient%semiconductors.

15 Summit$will$be$the$world s$smartest$ supercomputer$for$open$science But%what%can%a%smart%supercomputer%do? Science%challenges%for%a%smart%supercomputer:% Identifying$NextEgeneration$Materials By%training%AI%algorithms%to%predict%material% properties%from%experimental%data,% longstanding%questions%about%material% behavior%at%atomic%scales%could%be%answered% for%better%batteries,%more%resilient%building% materials,%and%more%efficient%semiconductors.% Predicting$Fusion$Energy Predictive%AI%software%is%already%helping% scientists%anticipate%disruptions%to%the%volatile% plasmas%inside%experimental%reactors.% Summit s%arrival%allows%researchers%to%take% this%work%to%the%next%level%and%further% integrate%ai%with%fusion%technology.% Deciphering$HighEenergy$Physics$Data With%AI%supercomputing,%physicists%can%lean%on% machines%to%identify%important%pieces%of% information data%that s%too%massive%for%any% single%human%to%handle%and%that%could%change% our%understanding%of%the%universe. Combating$Cancer Through%the%development%of%scalable%deep% neural%networks,%scientists%at%the%us% Department%of%Energy%and%the%National% Cancer%Institute%are%making%strides%in% improving%cancer%diagnosis%and%treatment.%

16 Summit$is$still$under$construction We%expect%to%accept%the%machine%in%Summer%of%2018,%allow%early%users%on%this% year,%and%allocate%our%first%users%through%the%incite%program%in%january%2019.% We%are%continuing%node%and%file%storage%installation%and%software%testing.%%

17 Questions? Jack$Wells

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership