Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures MARCO CERIANI SIMONE SECCHI ANTONINO TUMEO ORESTE VILLA GIANLUCA PALERMO Politecnico di Milano - DEI, 20133, Milano, Italy. {mceriani,gpalermo}@elet.polimi.it Universita degli Studi di Cagliari - DIEE, 09123, Cagliari, Italy. simone.secchi@diee.unica.it Pacific Northwest National Laboratory, Richland, WA. antonino.tumeo@pnnl.gov December 6, 2013 CARL 2013 1
New generation of irregular HPC applications Big Science Bioinforma5cs Community Detec5on Complex Networks Seman5c Databases Knowledge Discovery Language Understanding PaBern Recogni5on December 6, 2013 2
Characteristics of Emerging Irregular Applications! Use pointer or linked list-based data structures! Graphs, unbalanced trees, unstructured grids! Fine grained data accesses! Very large datasets! Way more than what is currently available for single cluster nodes! Very difficult to partition without generating load unbalancing! Very poor spatial and temporal locality! Unpredictable network and memory accesses! Memory and network bandwidth limited!! Large amounts of parallelism (e.g., each vertex, each edge in the graph)! But irregularity in control! If (vertex==x) z; else k December 6, 2013 3
Objective! We aim at designing a full-system architecture for irregular applications starting from off-the-shelf cores! Big datasets imply a multi-node architecture! We do it by:! Introducing custom hardware and software components that optimize the architecture for executing multi-node irregular applications! Employing a FPGA prototype to validate the approach December 6, 2013 4
Supporting Irregular Applications Fine- grain Global Address Space Fast Context Switching Hardware Synch! Fast context switching: tolerates latencies! Fine-grain global address space: removes partitioning requirements, simplify code development! Hardware synch: increase performance with synchronization intensive workloads December 6, 2013 5
Why a prototype?! Hardware components designed at the register transfer level! Stronger validation than a simulator! Enable capturing primary performance issues! Expose hardware implementation challenges! Higher speed than a simulation infrastructure! Allows faster iterations between hardware and software! Software layer can be co-developed and evaluated with the hardware December 6, 2013 6
Node Architecture Overview! MicroBlaze processors! Connected to private scratchpads! All access a shared external DDR3 memory! Internal interconnection: AXI! External interconnection: Aurora! Three custom hardware components! GMAS: Global Memory Access Scheduler! GNI: Global Network Interface! GSYNC: Global SYNChronization module! Support for lightweight software multithreading December 6, 2013 7
Programming model! Global address space! shared-memory programming model on top of a distributed memory machine! Developer allocates and frees memory areas in the global address space by using standard memory allocation primitives.! The Application Programming Interface (API) provides:! Extended malloc and free primitives that support allocation in the shared global memory space and the node-local memory space! POSIX-like thread management: thread creation, join, yield! Synchronization routines: lock, spinning lock, unlock, barrier! Application developed with a Single Program Multiple Data (SPMD) approach.! Each thread executes the same code on different elements of the dataset! In the current prototype, contexts of the thread are stored in private scratchpads and do not migrate! Potential load imbalance, faster context switching! Alternative approach: storing contexts in the global address space, prefetching in the scratchpads December 6, 2013 8
Quad-Board Prototyping Platform! 4 Xilinx Virtex-6 ML605 boards Virtex-6 LX240T devices! Xilinx ISE Embedded Design Suite 13.4! Prototyped a quad-node systems December 6, 2013 9
GMAS! One per core! Forwards memory operations from the cores to the memories! Enables scrambled global address space support! Hosts Load Store Queues for long latency memory operations! Provides thread ids to the core! Provides interface to the GSYNC December 6, 2013 10
GMAS Operation! When a core emits a memory operation! The GMAS descrambles it and verify its destination! If it is local (local memories, local part of the global address space)! It is directly forwarded to the destination memory! If it is remote! The request is sent to the GNI! The related information of the memory operation are saved in the LSQ block, the pending is set! A canary value is sent to the core, setting the redo bit! An interrupt is triggered, starting a context switch! When the reply to the remote reference comes back! The pending bit is reset, allowing the source thread to be scheduled! When the thread is scheduled, it re-executes the memory operation and the redo bit is reset December 6, 2013 11
GNI! A GNI for each node! Interfaces AXI with the Network (Aurora)! Translates internal network protocol to external network protocol and viceversa! Packet contains: header with source node, original AXI transaction! The destination GNI translates the incoming transaction, executes the memory operation, and sends back the result December 6, 2013 12
GSYNC! A GSYNC for each node! Implements a lock table of configurable size! Each GSYNC stores locks for the addresses on its own node! Direct Mapping: multiple addresses share the same lock (aliasing)! When a core write on the lock register of the GMAS! A load is sent to the GSYNC addressing the related lock bit! The GSYNC handles the load as a bit swap, and returns the current value in the slot! Locks not taken are retried in software! When a core writes on the unlock register of the GMAS! A store with value 0 is sent to the GSYNC addressing the related lock bit! Remote GSYNC are accessed through the GNI as normal remote memory operations December 6, 2013 13
Experimental setup! 4 nodes! From 1 to 32 MicroBlazes per node! From 1 to 4 threads per MicroBlaze! 512 MB per node, 32 MB as local memory, the rest exposed in the global address space for a total of 1920 MB! Scrambling: 8 bytes - GNSYNC Lock table: 8196 entries! Bandwidth: 1.5 Gbps (500 Mbit per channel), 1/3 overhead for headers (1 Gbps effective)! Frequency: 100 MHz! Delays:! Context switch: 232 cycles (41 ISR launch, 65 save context, 20 launch scheduler, 50 load context, 24 interrupt reset, 50 exit ISR)! Round trip for a remote memory reference: 403 cycles! Applications! Pointer chasing! Breadth First Search (BFS) December 6, 2013 14
Area of the Hardware Components! Area with respect to a Virtex 6 LX240T December 6, 2013 15
Experimental results - Pointer Chasing! BW utilization increases with the number of cores! BW utilization also increases with the number of threads! However, system is saturated with 3 threads! Utilization decreases with 3 and 4 threads and 32 cores wrt 16 cores because of higher contention on the internal interconnection December 6, 2013 16
Experimental results - BFS! 100,000 vertices! 80 neighbors in average! 3,998,706 traversed edges! Throughput increases with the number of cores! Biggest increase from 4 to 8 cores! Increasing the number of threads from 1 to 3 increases performance! However, with 4 threads performance decreases! Increased contention on the GSYNC for the locks (BFS is synch intensive) December 6, 2013 17
Conclusions! Presented the set of hardware and software components that enable efficient execution of irregular applications on a manycore multinode system, Starting from off-the-shelf cores! Support for global address space and long latency remote memory operation (GMAS)! Fine-grained hardware synchronization (GSYNC)! Integrated network interface (GNI)! Fast software multithreading (with hardware supported scheduling)! Introduced an FPGA prototype of the proposed design! Validated the prototype with two typical irregular kernels! Scaling in bandwidth utilization and performance when increasing cores and threads December 6, 2013 18
Thank you for your attention!! Questions?! antonino.tumeo@pnnl.gov December 6, 2013 19