Software and Tools for HPE s The Machine Project

Similar documents
Profiling: Understand Your Application

Jackson Marusarz Intel Corporation

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

Closing the Performance Gap Between Volatile and Persistent K-V Stores

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Review: Creating a Parallel Program. Programming for Performance

NEXTGenIO Performance Tools for In-Memory I/O

AUTOMATIC SMT THREADING

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Detecting Memory-Boundedness with Hardware Performance Counters

Messaging Overview. Introduction. Gen-Z Messaging

FOEDUS: OLTP Engine for a Thousand Cores and NVRAM

Near Memory Key/Value Lookup Acceleration MemSys 2017

Practical Near-Data Processing for In-Memory Analytics Frameworks

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

Datenbanksysteme II: Modern Hardware. Stefan Sprenger November 23, 2016

Comparing Memory Systems for Chip Multiprocessors

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Simultaneous Multithreading on Pentium 4

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Toward a Memory-centric Architecture

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

STORAGE LATENCY x. RAMAC 350 (600 ms) NAND SSD (60 us)

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Introducing the Cray XMT. Petr Konecny May 4 th 2007

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

Logging in Persistent Memory: to Cache, or Not to Cache? Mengjie Li, Matheus Ogleari, Jishen Zhao

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Intel Xeon Phi архитектура, модели программирования, оптимизация.

STAR-CCM+ Performance Benchmark and Profiling. July 2014

CEC 450 Real-Time Systems

Intel Xeon Phi архитектура, модели программирования, оптимизация.

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Processors, Performance, and Profiling

Multiprocessor Systems. Chapter 8, 8.1

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks

Gen-Z Memory-Driven Computing

Hybrid Memory Platform

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

Master Informatics Eng.

Martin Kruliš, v

Analyzing I/O Performance on a NEXTGenIO Class System

JouleSort: A Balanced Energy-Efficiency Benchmark

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Intel Architecture for Software Developers

I/O Profiling Towards the Exascale

Data Processing at the Speed of 100 Gbps using Apache Crail. Patrick Stuedi IBM Research

Introduction to Performance Tuning & Optimization Tools

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

<Insert Picture Here> Boost Linux Performance with Enhancements from Oracle

ADVANCED IN-MEMORY COMPUTING USING SUPERMICRO MEMX SOLUTION

Intel VTune Amplifier XE

HPE ProLiant Gen10. Franz Weberberger Presales Consultant Server

SPECULATIVE MULTITHREADED ARCHITECTURES

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

Parallel and Distributed Optimization with Gurobi Optimizer

Leveraging Flash in HPC Systems

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Crossing the Chasm: Sneaking a parallel file system into Hadoop

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Simulating Stencil-based Application on Future Xeon-Phi Processor

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

VoltDB vs. Redis Benchmark

SAP High-Performance Analytic Appliance on the Cisco Unified Computing System

Enabling and Optimizing MariaDB on Qualcomm Centriq 2400 Arm-based Servers

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

High Performance Computing on GPUs using NVIDIA CUDA

Lecture 29 Review" CPU time: the best metric" Be sure you understand CC, clock period" Common (and good) performance metrics"

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Exploiting the full power of modern industry standard Linux-Systems with TSM Stephan Peinkofer

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

The GeantV prototype on KNL. Federico Carminati, Andrei Gheata and Sofia Vallecorsa for the GeantV team

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Multi-threaded Queries. Intra-Query Parallelism in LLVM

DRAM and Storage-Class Memory (SCM) Overview

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Chapter 2: Memory Hierarchy Design Part 2

Kaisen Lin and Michael Conley

Ron Kalla, Balaram Sinharoy, Joel Tendler IBM Systems Group

Hybrid Implementation of 3D Kirchhoff Migration

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Transcription:

Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi

Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2

CPU-Centric Terminology Last branch record Line fill buffer Core-originated cacheable demand requests that refer to L3 STALLED-CYCLES-FRONTEND IDLE-CYCLES-FRONTEND? 3

Where is the problem? CPU or Memory subsystem? 4 Figure credit: http://lobojosden.blogspot.com/2010/02/800-pound-gorilla.html

The Machine SOC SOC SOC SOC Large pool of Non Volatile Memory SOC SOC SOC SOC Memory-driven computing 5

Electrons for computation Photon for communication Ions for storage 6

Features Massive pool of non-volatile memory (FAM) Many, many-core SOCs Fast optical interconnect System-wide load-store access Prototypes Non coherent memory, coherent SOCs Each SOC will run an instance of OS 7

Software Portfolio Linux for The Machine Atlas: NVM programming to enable legacy lock-based multi-threaded code to utilize NVM Ensure persistent consistence in the event or process crashes Lock-free programming on FAM Large-scale graph analytics and machine leaning Spark: In-memory MapReduce Application scaling: Massive scale sorting 8

Sort Benchmark Sort Benchmark: http://sortbenchmark.org/ Kinds: Gray sort: 100TB data Uniform random (Daytona): 100 byte records with 10 byte keys Skewed (Indy) Minute sort Joule sort 9

2015 Gray Sort Winner: FuxiSort by Alibaba Group Inc. 100 TB in 377 seconds 3,134 nodes x 2 Xeon E5-2630@2.30Ghz, 96 GB memory, 12x2 TB SATA HD, 10 Gb/s Ethernet 243 nodes x 2 Xeon E5-2650v2@2.60Ghz, 128 GB memory, 12x2 TB SATA HD, 10 Gb/s Ethernet Total resource (3,134 x 2 x 6 x 2 + 243 x 2 x 8 x 2) = 82992 hardware threads 324 TB DRAM Efficiency: 100 TB / 377 sec / 82992 hw-threads = 3.2 MB/sec/hw-thread 10

Sort on HPE Integrity Superdome X 8 blade, 16 socket, 15-core Intel E7-8880 v2 (Ivy Bridge) 16 x 15 x 2-way = 480 hw threads 24 TB DRAM New machines come with 48 TB A simulated 16 SOC machine 11

Sorting Strategy DRAM SOC Generate Raw data Sample Raw data Partition A_ A_ B_ B_ Read DRAM Sort Write Sorted A B SOC DRAM Raw data Raw data K_ L_ K_ L_ Read DRAM Sort Write K L 12

Pipelined Sort FAM Read DRAM Sort Write FAM vs. Read Sort Write Read Sort Write Read Sort Write 13

Which Local Sorting Algorithm? Options Parallel quick sort for 1B keys: T 1 /T inf = 30 Can t scale beyond 30 cores Parallel sample sort Great except TLB capacity Parallel merge sort for 1B keys: T1 /T inf = millions Temporary buffer allocation and deallocation can cause memory management issues Bitonic sorting Great scaling but poor absolute performance We use GNU parallel sort A spaghetti code of parallel merge sort with some adaptive algorithms Best absolute performance on the machine we tested 14

Sort4TM v0.1 on Superdome X Achieved 1.15 MB/sec/hw-thread for a 10 TB sort 15

Sort4TM v0.1 on Superdome X Achieved 1.15 MB/sec/hw-thread for a 10 TB sort Sort threads idle waiting for work in the runtime library Cause: A solo DRAM-to-FAM data mover thread delaying other sort worker threads Solution: Dedicate more data movers A timeline trace of Sort4TM via HPCToolkit 16

Sort4TM v0.1 on Superdome X Achieved 1.15 MB/sec/hw-thread for a 10 TB sort Time DRAM->FAM write DRAM->FAM write FAM->DRAM read FAM->DRAM read DRAM sort DRAM sort DRAM sort Idle DRAM sort DRAM sort DRAM sort Idle Threads DRAM sort DRAM sort Problem: Idleness, a solo DRAM-to-FAM data mover delaying other workers Solution: Dedicate more data movers to load balance 17

Sort4TM v0.1 on Superdome X Achieved 1.15 MB/sec/hw-thread for a 10 TB sort Time DRAM->FAM write DRAM->FAM write FAM->DRAM read DRAM sort DRAM sort DRAM sort DRAM sort DRAM->FAM write DRAM->FAM write FAM->DRAM read DRAM sort DRAM sort DRAM sort DRAM sort Threads Problem: Idleness, a solo DRAM-to-FAM data mover delaying other workers Solution: Dedicate more data movers to load balance 18

Sort4TM v0.2 on Superdome X Multiple writers, achieved 100GB in ~9 seconds 22 MB/sec/hw-thread Load imbalance across SOCs A timeline trace of Sort4TM via HPCToolkit 19

Sort4TM v0.3 on Superdome X Now CPUs are at least working but quite inefficiently! 40% time in memcmp (comparing two keys) Locks were used inefficiently (15% time in lock waiting) Solution: Replaced byte-by-byte memcmp with a fast comparator 30% speedup Replaced naive library locks (P2P) with user-land barriers (collective) 10% speedup 20

Redundancies in Large Code Bases Redundancies arise and get unnoticed in large code bases Sort4TM is a 20K code base! An example redundancy: Dead initialization of a large buffer array class KEY { KEY(){ memset( ) memset( ) memset( ) } }; for (int i = 0 ; i < N; i++) { KEY drambuffer [] = new KEY[numItems]; Copy(dramBuffer, famitems, numitems); // work on items delete [] drambuffer [] } 21

Tool to Pinpoint Wasteful Memory Accesses Redundancies arise and get unnoticed in large code bases Sort4TM is a 20K code base! An example redundancy: Dead initialization of a large buffer array for (int i = 0 ; i < N; i++) { KEY drambuffer [] = new KEY[numItems]; Copy(dramBuffer, famitems, numitems); // work on items delete [] drambuffer [] } class KEY { KEY(){ memset( ) memset( ) memset( ) } }; Redundancy DeadSpy: a tool to pinpoint program inefficiencies [Chabbi and Mellor-Crummey, CGO 12] 22

Eliminating Redundancy Redundancies arise and get unnoticed in large code bases Sort4TM is a 20K code base! An example redundancy: Dead initialization of a large buffer array KEY drambuffer [] = new KEY[maxItems]; for (int i = 0 ; i < N; i++) { KEY drambuffer [] = new KEY[numItems]; Copy(dramBuffer, famitems, numitems); // work on items delete [] drambuffer [] } delete [] drambuffer [] 23

Ineffective Pipeline Parallelism Read Sort Write vs. Read Sort Write Reality Read Sort Write 24

Investigation Is cache getting polluted? No: Data movement is already non-temporal for Read and Sort Are we sacrificing concurrency during sorting? No: increasing the number of sorters was barely helping beyond a point Threw all tools at disposal vtune, HPCToolkit, Perf, Internal bandwidth monitoring Findings: Low IPC DRAM BW, QPI BW, ASIC BW are well below saturation Sort phase has high L1, L2 hits Very high branch misprediction Inference: Sort phase has heavy data-dependent computation A few DRAM bound loads are so critical (latency sensitive) that that slightest overlapping of a bandwidth-bound workload severely degrades sorting 25

Semantic Gap Between Tools and Programs Programmer has little clue about the causes of losses arising from micro architectural limitations Low-level information is arcane One tool does not give a holistic view vtune is heavyweight (in addition to being heavy on money) Never managed to run HPC analysis to completion Memory analysis is at least 3x slower and often ran out of /tmp disc space HPCToolkit leaves everything to the user and offers little automatic analysis of various counters If I already know which arcane counter to profile, perhaps I already know the problem Unclear what kind of analysis can immediately point to the root cause Not even clear what the root cause is Shouldn t processors have bandwidth provisioning to better support latency bound accesses? 26

Workaround for Architectural Limitations Leave the sort alone Read Sort Write Read Sort Write Read Sort Write Perhaps other sorting algorithms may do better! But we already investigated many! 27

Resulting Sort4TM 10TB in 193 seconds on 480 hw-threads of DragonHawk Sort rate: 108 MB/sec/hw-thread 93x improvement from the initial 1.15 MB/sec/hwthread 20TB @ 107 MB/sec/hw-thread Scales well! But will not scale linearly Fuxi sort from Alibaba: 3.2 MB/sec/hw-thread 28

Road Ahead Performance monitoring capabilities for various data movements: DRAM<->FAM data movement FAM<->FAM Tool ecosystem for other architectures Tools under emulation environment More applications to exploit a large memory machine Formal verification of parallel algorithms Each application poses its own unique challenge when scaled to the next level Each random access in a large hash-table causes TLB misses 29

Where is the problem? CPU or Memory subsystem? 30 Figure credit: http://lobojosden.blogspot.com/2010/02/800-pound-gorilla.html