Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

Similar documents
Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

FPGA Prototyping of Manycore Multinode Systems for Irregular Applications

Exploring Efficient Hardware Support for Applications with Irregular Memory Patterns on Multinode Manycore Architectures

EXPLORING ARCHITECTURAL SUPPORT FOR APPLICATIONS WITH IRREGULAR MEMORY PATTERNS ON DISTRIBUTED MANYCORE SYSTEMS

Politecnico di Milano

Politecnico di Milano

Scalable, multithreaded, shared memory machine Designed for single word random global access patterns Very good at large graph problems

A Dual-Priority Real-Time Multiprocessor System on FPGA for Automotive Applications

Practical Near-Data Processing for In-Memory Analytics Frameworks

Network Interface Architecture and Prototyping for Chip and Cluster Multiprocessors

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Portland State University ECE 588/688. Graphics Processors

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Parallel graph traversal for FPGA

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Review: Creating a Parallel Program. Programming for Performance

Multithreading and the Tera MTA. Multithreading for Latency Tolerance

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

Efficiency and memory footprint of Xilkernel for the Microblaze soft processor

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Implementing Radix Sort on Emu 1

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

GPU Sparse Graph Traversal

The latency of user-to-user, kernel-to-kernel and interrupt-to-interrupt level communication

Self-Aware Adaptation in FPGA-based Systems

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

ReconOS: An RTOS Supporting Hardware and Software Threads

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Nikos Anastopoulos, Konstantinos Nikas, Georgios Goumas and Nectarios Koziris

6.9. Communicating to the Outside World: Cluster Networking

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Continuum Computer Architecture

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

VEOS high level design. Revision 2.1 NEC

MutekH embedded operating system. January 10, 2013

«Real Time Embedded systems» Multi Masters Systems

Latency-Tolerant Software Distributed Shared Memory

Harp-DAAL for High Performance Big Data Computing

Scalable GPU Graph Traversal!

Parallel repartitioning and remapping in

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Designing Next-Generation Data- Centers with Advanced Communication Protocols and Systems Services. Presented by: Jitong Chen

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

Comparing Memory Systems for Chip Multiprocessors

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Software-Controlled Multithreading Using Informing Memory Operations

Near Memory Key/Value Lookup Acceleration MemSys 2017

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

High Performance Packet Processing with FlexNIC

GPU Sparse Graph Traversal. Duane Merrill

Introducing the Cray XMT. Petr Konecny May 4 th 2007

B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes

Architecture or Parallel Computers CSC / ECE 506

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Application Programming

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

ECE 574 Cluster Computing Lecture 8

An efficient Unbounded Lock-Free Queue for Multi-Core Systems

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems

Programming as Successive Refinement. Partitioning for Performance

Best Practices for Setting BIOS Parameters for Performance

EXTENDING AN ASYNCHRONOUS MESSAGING LIBRARY USING AN RDMA-ENABLED INTERCONNECT. Konstantinos Alexopoulos ECE NTUA CSLab

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

Synchronization. Coherency protocols guarantee that a reading processor (thread) sees the most current update to shared data.

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

ReconOS: Multithreaded Programming and Execution Models for Reconfigurable Hardware

APENet: LQCD clusters a la APE

FaSST: Fast, Scalable, and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs

Introduction to Parallel Programming Models

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Internet Technology. 06. Exam 1 Review Paul Krzyzanowski. Rutgers University. Spring 2016

CS 333 Introduction to Operating Systems. Class 3 Threads & Concurrency. Jonathan Walpole Computer Science Portland State University

WhatÕs New in the Message-Passing Toolkit

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

TDDD07 Real-time Systems Lecture 10: Wrapping up & Real-time operating systems

Internet Technology 3/2/2016

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Hybrid MPI - A Case Study on the Xeon Phi Platform

XPU A Programmable FPGA Accelerator for Diverse Workloads

Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources

Spartan-6 & Virtex-6 FPGA Connectivity Kit FAQ

Evolution of the netmap architecture

POSIX Threads: a first step toward parallel programming. George Bosilca

G-NET: Effective GPU Sharing In NFV Systems

An adaptive genetic algorithm for dynamically reconfigurable modules allocation

Compute Node Design for DAQ and Trigger Subsystem in Giessen. Justus Liebig University in Giessen

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Transcription:

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures MARCO CERIANI SIMONE SECCHI ANTONINO TUMEO ORESTE VILLA GIANLUCA PALERMO Politecnico di Milano - DEI, 20133, Milano, Italy. {mceriani,gpalermo}@elet.polimi.it Universita degli Studi di Cagliari - DIEE, 09123, Cagliari, Italy. simone.secchi@diee.unica.it Pacific Northwest National Laboratory, Richland, WA. antonino.tumeo@pnnl.gov December 6, 2013 CARL 2013 1

New generation of irregular HPC applications Big Science Bioinforma5cs Community Detec5on Complex Networks Seman5c Databases Knowledge Discovery Language Understanding PaBern Recogni5on December 6, 2013 2

Characteristics of Emerging Irregular Applications! Use pointer or linked list-based data structures! Graphs, unbalanced trees, unstructured grids! Fine grained data accesses! Very large datasets! Way more than what is currently available for single cluster nodes! Very difficult to partition without generating load unbalancing! Very poor spatial and temporal locality! Unpredictable network and memory accesses! Memory and network bandwidth limited!! Large amounts of parallelism (e.g., each vertex, each edge in the graph)! But irregularity in control! If (vertex==x) z; else k December 6, 2013 3

Objective! We aim at designing a full-system architecture for irregular applications starting from off-the-shelf cores! Big datasets imply a multi-node architecture! We do it by:! Introducing custom hardware and software components that optimize the architecture for executing multi-node irregular applications! Employing a FPGA prototype to validate the approach December 6, 2013 4

Supporting Irregular Applications Fine- grain Global Address Space Fast Context Switching Hardware Synch! Fast context switching: tolerates latencies! Fine-grain global address space: removes partitioning requirements, simplify code development! Hardware synch: increase performance with synchronization intensive workloads December 6, 2013 5

Why a prototype?! Hardware components designed at the register transfer level! Stronger validation than a simulator! Enable capturing primary performance issues! Expose hardware implementation challenges! Higher speed than a simulation infrastructure! Allows faster iterations between hardware and software! Software layer can be co-developed and evaluated with the hardware December 6, 2013 6

Node Architecture Overview! MicroBlaze processors! Connected to private scratchpads! All access a shared external DDR3 memory! Internal interconnection: AXI! External interconnection: Aurora! Three custom hardware components! GMAS: Global Memory Access Scheduler! GNI: Global Network Interface! GSYNC: Global SYNChronization module! Support for lightweight software multithreading December 6, 2013 7

Programming model! Global address space! shared-memory programming model on top of a distributed memory machine! Developer allocates and frees memory areas in the global address space by using standard memory allocation primitives.! The Application Programming Interface (API) provides:! Extended malloc and free primitives that support allocation in the shared global memory space and the node-local memory space! POSIX-like thread management: thread creation, join, yield! Synchronization routines: lock, spinning lock, unlock, barrier! Application developed with a Single Program Multiple Data (SPMD) approach.! Each thread executes the same code on different elements of the dataset! In the current prototype, contexts of the thread are stored in private scratchpads and do not migrate! Potential load imbalance, faster context switching! Alternative approach: storing contexts in the global address space, prefetching in the scratchpads December 6, 2013 8

Quad-Board Prototyping Platform! 4 Xilinx Virtex-6 ML605 boards Virtex-6 LX240T devices! Xilinx ISE Embedded Design Suite 13.4! Prototyped a quad-node systems December 6, 2013 9

GMAS! One per core! Forwards memory operations from the cores to the memories! Enables scrambled global address space support! Hosts Load Store Queues for long latency memory operations! Provides thread ids to the core! Provides interface to the GSYNC December 6, 2013 10

GMAS Operation! When a core emits a memory operation! The GMAS descrambles it and verify its destination! If it is local (local memories, local part of the global address space)! It is directly forwarded to the destination memory! If it is remote! The request is sent to the GNI! The related information of the memory operation are saved in the LSQ block, the pending is set! A canary value is sent to the core, setting the redo bit! An interrupt is triggered, starting a context switch! When the reply to the remote reference comes back! The pending bit is reset, allowing the source thread to be scheduled! When the thread is scheduled, it re-executes the memory operation and the redo bit is reset December 6, 2013 11

GNI! A GNI for each node! Interfaces AXI with the Network (Aurora)! Translates internal network protocol to external network protocol and viceversa! Packet contains: header with source node, original AXI transaction! The destination GNI translates the incoming transaction, executes the memory operation, and sends back the result December 6, 2013 12

GSYNC! A GSYNC for each node! Implements a lock table of configurable size! Each GSYNC stores locks for the addresses on its own node! Direct Mapping: multiple addresses share the same lock (aliasing)! When a core write on the lock register of the GMAS! A load is sent to the GSYNC addressing the related lock bit! The GSYNC handles the load as a bit swap, and returns the current value in the slot! Locks not taken are retried in software! When a core writes on the unlock register of the GMAS! A store with value 0 is sent to the GSYNC addressing the related lock bit! Remote GSYNC are accessed through the GNI as normal remote memory operations December 6, 2013 13

Experimental setup! 4 nodes! From 1 to 32 MicroBlazes per node! From 1 to 4 threads per MicroBlaze! 512 MB per node, 32 MB as local memory, the rest exposed in the global address space for a total of 1920 MB! Scrambling: 8 bytes - GNSYNC Lock table: 8196 entries! Bandwidth: 1.5 Gbps (500 Mbit per channel), 1/3 overhead for headers (1 Gbps effective)! Frequency: 100 MHz! Delays:! Context switch: 232 cycles (41 ISR launch, 65 save context, 20 launch scheduler, 50 load context, 24 interrupt reset, 50 exit ISR)! Round trip for a remote memory reference: 403 cycles! Applications! Pointer chasing! Breadth First Search (BFS) December 6, 2013 14

Area of the Hardware Components! Area with respect to a Virtex 6 LX240T December 6, 2013 15

Experimental results - Pointer Chasing! BW utilization increases with the number of cores! BW utilization also increases with the number of threads! However, system is saturated with 3 threads! Utilization decreases with 3 and 4 threads and 32 cores wrt 16 cores because of higher contention on the internal interconnection December 6, 2013 16

Experimental results - BFS! 100,000 vertices! 80 neighbors in average! 3,998,706 traversed edges! Throughput increases with the number of cores! Biggest increase from 4 to 8 cores! Increasing the number of threads from 1 to 3 increases performance! However, with 4 threads performance decreases! Increased contention on the GSYNC for the locks (BFS is synch intensive) December 6, 2013 17

Conclusions! Presented the set of hardware and software components that enable efficient execution of irregular applications on a manycore multinode system, Starting from off-the-shelf cores! Support for global address space and long latency remote memory operation (GMAS)! Fine-grained hardware synchronization (GSYNC)! Integrated network interface (GNI)! Fast software multithreading (with hardware supported scheduling)! Introduced an FPGA prototype of the proposed design! Validated the prototype with two typical irregular kernels! Scaling in bandwidth utilization and performance when increasing cores and threads December 6, 2013 18

Thank you for your attention!! Questions?! antonino.tumeo@pnnl.gov December 6, 2013 19