Software and Tools for HPE s The Machine Project

Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi

Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2

CPU-Centric Terminology Last branch record Line fill buffer Core-originated cacheable demand requests that refer to L3 STALLED-CYCLES-FRONTEND IDLE-CYCLES-FRONTEND? 3

Where is the problem? CPU or Memory subsystem? 4 Figure credit: http://lobojosden.blogspot.com/2010/02/800-pound-gorilla.html

The Machine SOC SOC SOC SOC Large pool of Non Volatile Memory SOC SOC SOC SOC Memory-driven computing 5

Electrons for computation Photon for communication Ions for storage 6

Features Massive pool of non-volatile memory (FAM) Many, many-core SOCs Fast optical interconnect System-wide load-store access Prototypes Non coherent memory, coherent SOCs Each SOC will run an instance of OS 7

Software Portfolio Linux for The Machine Atlas: NVM programming to enable legacy lock-based multi-threaded code to utilize NVM Ensure persistent consistence in the event or process crashes Lock-free programming on FAM Large-scale graph analytics and machine leaning Spark: In-memory MapReduce Application scaling: Massive scale sorting 8

Sort Benchmark Sort Benchmark: http://sortbenchmark.org/ Kinds: Gray sort: 100TB data Uniform random (Daytona): 100 byte records with 10 byte keys Skewed (Indy) Minute sort Joule sort 9

2015 Gray Sort Winner: FuxiSort by Alibaba Group Inc. 100 TB in 377 seconds 3,134 nodes x 2 Xeon E5-2630@2.30Ghz, 96 GB memory, 12x2 TB SATA HD, 10 Gb/s Ethernet 243 nodes x 2 Xeon E5-2650v2@2.60Ghz, 128 GB memory, 12x2 TB SATA HD, 10 Gb/s Ethernet Total resource (3,134 x 2 x 6 x 2 + 243 x 2 x 8 x 2) = 82992 hardware threads 324 TB DRAM Efficiency: 100 TB / 377 sec / 82992 hw-threads = 3.2 MB/sec/hw-thread 10

Sort on HPE Integrity Superdome X 8 blade, 16 socket, 15-core Intel E7-8880 v2 (Ivy Bridge) 16 x 15 x 2-way = 480 hw threads 24 TB DRAM New machines come with 48 TB A simulated 16 SOC machine 11

Sorting Strategy DRAM SOC Generate Raw data Sample Raw data Partition A_ A_ B_ B_ Read DRAM Sort Write Sorted A B SOC DRAM Raw data Raw data K_ L_ K_ L_ Read DRAM Sort Write K L 12

Pipelined Sort FAM Read DRAM Sort Write FAM vs. Read Sort Write Read Sort Write Read Sort Write 13

Which Local Sorting Algorithm? Options Parallel quick sort for 1B keys: T 1 /T inf = 30 Can t scale beyond 30 cores Parallel sample sort Great except TLB capacity Parallel merge sort for 1B keys: T1 /T inf = millions Temporary buffer allocation and deallocation can cause memory management issues Bitonic sorting Great scaling but poor absolute performance We use GNU parallel sort A spaghetti code of parallel merge sort with some adaptive algorithms Best absolute performance on the machine we tested 14

Sort4TM v0.1 on Superdome X Achieved 1.15 MB/sec/hw-thread for a 10 TB sort 15

Sort4TM v0.1 on Superdome X Achieved 1.15 MB/sec/hw-thread for a 10 TB sort Sort threads idle waiting for work in the runtime library Cause: A solo DRAM-to-FAM data mover thread delaying other sort worker threads Solution: Dedicate more data movers A timeline trace of Sort4TM via HPCToolkit 16

Sort4TM v0.1 on Superdome X Achieved 1.15 MB/sec/hw-thread for a 10 TB sort Time DRAM->FAM write DRAM->FAM write FAM->DRAM read FAM->DRAM read DRAM sort DRAM sort DRAM sort Idle DRAM sort DRAM sort DRAM sort Idle Threads DRAM sort DRAM sort Problem: Idleness, a solo DRAM-to-FAM data mover delaying other workers Solution: Dedicate more data movers to load balance 17

Sort4TM v0.1 on Superdome X Achieved 1.15 MB/sec/hw-thread for a 10 TB sort Time DRAM->FAM write DRAM->FAM write FAM->DRAM read DRAM sort DRAM sort DRAM sort DRAM sort DRAM->FAM write DRAM->FAM write FAM->DRAM read DRAM sort DRAM sort DRAM sort DRAM sort Threads Problem: Idleness, a solo DRAM-to-FAM data mover delaying other workers Solution: Dedicate more data movers to load balance 18

Sort4TM v0.2 on Superdome X Multiple writers, achieved 100GB in ~9 seconds 22 MB/sec/hw-thread Load imbalance across SOCs A timeline trace of Sort4TM via HPCToolkit 19

Sort4TM v0.3 on Superdome X Now CPUs are at least working but quite inefficiently! 40% time in memcmp (comparing two keys) Locks were used inefficiently (15% time in lock waiting) Solution: Replaced byte-by-byte memcmp with a fast comparator 30% speedup Replaced naive library locks (P2P) with user-land barriers (collective) 10% speedup 20

Redundancies in Large Code Bases Redundancies arise and get unnoticed in large code bases Sort4TM is a 20K code base! An example redundancy: Dead initialization of a large buffer array class KEY { KEY(){ memset( ) memset( ) memset( ) } }; for (int i = 0 ; i < N; i++) { KEY drambuffer [] = new KEY[numItems]; Copy(dramBuffer, famitems, numitems); // work on items delete [] drambuffer [] } 21

Tool to Pinpoint Wasteful Memory Accesses Redundancies arise and get unnoticed in large code bases Sort4TM is a 20K code base! An example redundancy: Dead initialization of a large buffer array for (int i = 0 ; i < N; i++) { KEY drambuffer [] = new KEY[numItems]; Copy(dramBuffer, famitems, numitems); // work on items delete [] drambuffer [] } class KEY { KEY(){ memset( ) memset( ) memset( ) } }; Redundancy DeadSpy: a tool to pinpoint program inefficiencies [Chabbi and Mellor-Crummey, CGO 12] 22

Eliminating Redundancy Redundancies arise and get unnoticed in large code bases Sort4TM is a 20K code base! An example redundancy: Dead initialization of a large buffer array KEY drambuffer [] = new KEY[maxItems]; for (int i = 0 ; i < N; i++) { KEY drambuffer [] = new KEY[numItems]; Copy(dramBuffer, famitems, numitems); // work on items delete [] drambuffer [] } delete [] drambuffer [] 23

Ineffective Pipeline Parallelism Read Sort Write vs. Read Sort Write Reality Read Sort Write 24

Investigation Is cache getting polluted? No: Data movement is already non-temporal for Read and Sort Are we sacrificing concurrency during sorting? No: increasing the number of sorters was barely helping beyond a point Threw all tools at disposal vtune, HPCToolkit, Perf, Internal bandwidth monitoring Findings: Low IPC DRAM BW, QPI BW, ASIC BW are well below saturation Sort phase has high L1, L2 hits Very high branch misprediction Inference: Sort phase has heavy data-dependent computation A few DRAM bound loads are so critical (latency sensitive) that that slightest overlapping of a bandwidth-bound workload severely degrades sorting 25

Semantic Gap Between Tools and Programs Programmer has little clue about the causes of losses arising from micro architectural limitations Low-level information is arcane One tool does not give a holistic view vtune is heavyweight (in addition to being heavy on money) Never managed to run HPC analysis to completion Memory analysis is at least 3x slower and often ran out of /tmp disc space HPCToolkit leaves everything to the user and offers little automatic analysis of various counters If I already know which arcane counter to profile, perhaps I already know the problem Unclear what kind of analysis can immediately point to the root cause Not even clear what the root cause is Shouldn t processors have bandwidth provisioning to better support latency bound accesses? 26

Workaround for Architectural Limitations Leave the sort alone Read Sort Write Read Sort Write Read Sort Write Perhaps other sorting algorithms may do better! But we already investigated many! 27

Resulting Sort4TM 10TB in 193 seconds on 480 hw-threads of DragonHawk Sort rate: 108 MB/sec/hw-thread 93x improvement from the initial 1.15 MB/sec/hwthread 20TB @ 107 MB/sec/hw-thread Scales well! But will not scale linearly Fuxi sort from Alibaba: 3.2 MB/sec/hw-thread 28

Road Ahead Performance monitoring capabilities for various data movements: DRAM<->FAM data movement FAM<->FAM Tool ecosystem for other architectures Tools under emulation environment More applications to exploit a large memory machine Formal verification of parallel algorithms Each application poses its own unique challenge when scaled to the next level Each random access in a large hash-table causes TLB misses 29

Where is the problem? CPU or Memory subsystem? 30 Figure credit: http://lobojosden.blogspot.com/2010/02/800-pound-gorilla.html