6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

Similar documents
Optimal Configuration of Compute Nodes for Synthetic Aperture Radar Processing

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems

Lecture: Interconnection Networks

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar

RACEway Interlink Modules

Mark Sandstrom ThroughPuter, Inc.

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Design of Parallel Algorithms. Models of Parallel Computation

Network-on-chip (NOC) Topologies

Parallel Pipeline STAP System

CS575 Parallel Processing

The Architecture of a Homogeneous Vector Supercomputer

POLYMORPHIC ON-CHIP NETWORKS

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Interconnect Technology and Computational Speed

Parallel Architectures

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Topologies. Maurizio Palesi. Maurizio Palesi 1

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Scientific Applications. Chao Sun

Dr e v prasad Dt

Design, implementation, and evaluation of parallell pipelined STAP on parallel computers

Lecture 18: Communication Models and Architectures: Interconnection Networks

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Deadlock-free XY-YX router for on-chip interconnection network

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

IV. PACKET SWITCH ARCHITECTURES

FIRM: A Class of Distributed Scheduling Algorithms for High-speed ATM Switches with Multiple Input Queues

Achieving Distributed Buffering in Multi-path Routing using Fair Allocation

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

ANALYSIS OF CLUSTER INTERCONNECTION NETWORK TOPOLOGIES

Scalability and Classifications

The Network Layer and Routers

Introduction. The fundamental purpose of data communications is to exchange information between user's computers, terminals and applications programs.

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Chapter 8 : Multiprocessors

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

EE/CSCI 451: Parallel and Distributed Computation

LS Example 5 3 C 5 A 1 D

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

4. Networks. in parallel computers. Advances in Computer Architecture

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Introduction to ATM Technology

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications

BlueGene/L. Computer Science, University of Warwick. Source: IBM

CSC630/COS781: Parallel & Distributed Computing

SUMMERY, CONCLUSIONS AND FUTURE WORK

NoC Test-Chip Project: Working Document

Comparison of proposed path selection protocols for IEEE s WLAN mesh networks

This chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Parallel Architecture. Sathish Vadhiyar

EE/CSCI 451: Parallel and Distributed Computation

Parallel Algorithm Design. Parallel Algorithm Design p. 1

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns

A NEW ROUTER ARCHITECTURE FOR DIFFERENT NETWORK- ON-CHIP TOPOLOGIES

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

Design of a System-on-Chip Switched Network and its Design Support Λ

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

High Performance Computing. Introduction to Parallel Computing

INTERCONNECTION NETWORKS LECTURE 4

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

BACKGROUND OF THE INVENTION

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

Overlaid Mesh Topology Design and Deadlock Free Routing in Wireless Network-on-Chip. Danella Zhao and Ruizhe Wu Presented by Zhonghai Lu, KTH

A distributed architecture of IP routers

Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services

EE382 Processor Design. Illinois

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

Interconnection Network

Chapter 3 : Topology basics

Power and Area Efficient NOC Router Through Utilization of Idle Buffers

Three parallel-programming models

An Optical Data -Flow Computer

Optical Interconnection Networks in Data Centers: Recent Trends and Future Challenges

An Industrial Employee Development Application Protocol Using Wireless Sensor Networks

Educational Fusion. Implementing a Production Quality User Interface With JFC

Example Lecture 12: The Stiffness Method Prismatic Beams. Consider again the two span beam previously discussed and determine

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

CSCI-1680 Link Layer Wrap-Up Rodrigo Fonseca

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

Performance and Scalability with Griddable.io

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Concurrent/Parallel Processing

Transcription:

LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University, Lubbock, TX 0- {west, antonio}@ttu.edu ([WHQGHGÃ$EVWUDFW The focus of this work involves the investigation of parallelization and performance improvement for a class of radar signal processing techniques known as space-time adaptive processing (STAP). STAP refers to an extension of adaptive antenna signal processing methods that operate on a set of radar returns gathered from multiple elements of an antenna array over a specified time interval. Because the signal returns are composed of range, pulse, and antenna-element samples, a threedimensional (-D) cube naturally represents STAP data. Typical STAP data cube processing requirements range from -0 giga floating point operations per second (Gflops). Imposed real-time deadlines for STAP applications restricts processing to parallel computers composed of numerous interconnected compute nodes (s). A has one or more processors connected to a block of shared memory. Developing a solution to any problem on a parallel system is generally not a trivial task. The overall performance of many parallel systems is highly dependent upon network contention. In general, the mapping of data and the scheduling of communications impacts network contention of parallel architectures. The primary goals of many applications implemented on parallel architectures are to reduce latency and minimize interprocessor communication time (IPC) while maximizing throughput. It is indeed necessary to accomplish these objectives in STAP processing environments. In most STAP implementations, there are three phases of computations, one for each dimension of the data cube (i.e., range, pulse, and channel). To reduce computational latency, the processing at each phase must be distributed over multiple s using a single program multiple data (SPMD) approach. Additionally, prior to each processing phase, the data set must be partitioned in a fashion that attempts to equally distribute the computational load over the available s. Because each of the three phases process a different dimension of the data cube, the data must be redistributed to form contiguous vectors of the next dimension prior to the next processing phase. This redistribution of data or distributed corner-turn requires IPC. Minimizing the time required for interprocessor communication helps maximize STAP processing efficiency. Driven by the need to solve complex real-time applications that require tremendous computational bandwidths such as STAP algorithms, commercial-off-the-shelf

(COTS) embedded high-performance computing systems that emphasize upward scalability have emerged in the parallel processing environment. In a message passing parallel system, s are connected with each other via a common data communication fabric or interconnection network. For the purposes of discussion and illustration, assume that a crossbar with six bidirectional channels is the building block for the interconnection network. Each of the six input/output channels is bidirectional, but may only be driven in one direction at a time. The versatility of the six-port crossbar allows for the interconnect to be configured into a number of different network topologies, including two-dimensional (-D) and -D meshes, -D and -D rings, grids, and Clos networks. However, the most common configuration is a fat-tree, where the crossbars are connected in a parent-child fashion. In a fat-tree configuration, which is the configuration assumed in this paper, each crossbar has two parent ports and four child ports. The fat-tree architecture helps alleviate the problem of communication bottlenecks at high levels of the tree (present in conventional tree architectures) by increasing the number of effective parallel paths between s. Unfortunately, the addition of multiple paths between s increases the complexity of the communication pattern in applications such as STAP that involve data redistribution phases. Additional complexity emerges when each is composed of more than one processor or compute element (CE) configured with the shared-memory address space of the. In a system with one CE per, the communication pattern during distributed corner-turn phases is very regular and well-understood (i.e., a matrix transpose operation implemented in parallel). However, the overall complexity of both the mapping and scheduling of communications increases in systems where the s contain more than one CE, for two reasons. First, the communication pattern can be less regular. Second, the message sizes are not uniform. Two major challenges of implementing STAP algorithms on embedded highperformance systems are determining the best method for distributing the -D data set across s (i.e., the mapping strategy) and the scheduling of communication prior to each phase of computation. At each of the three phases of processing, data access is either vector-oriented along a data cube dimension or a plane-oriented combination of two data cube dimensions. During the processing at each phase, the contiguous vectors along the dimension of interest are distributed among the s for processing in parallel. Additionally, each CE may be responsible for processing one or more vectors of data during each phase. Before processing of the next phase can take place, the data must be redistributed among the available s to form contiguous vectors of the next dimension. Determining the optimal schedule of data transfers during phases of data repartitioning on a parallel system is a formidable task. The combination of these two factors, data mapping and communication scheduling, provides the key motivation for this work. One approach to data set distribution in STAP applications is to partition the data cube into sub-cube bars (see Fig. ). Each sub-cube bar is composed of partial data samples from two dimensions, while preserving one whole dimension of the datacube. After performing the necessary computations on the current whole dimension, the data vectors must be redistributed to form contiguous sub-cube bars of the next dimension to be processed. By implementing a sub-cube bar partitioning scheme, IPC between processing stages is isolated to clusters of s and not the entire system

(i.e., the required data exchanges occur only between s in the same logical row or column). To illustrate the impact of mapping, consider the two examples shown in Fig. and Fig.. For these two examples, assume that the parallel system is composed of four s, with each having three CEs, and connected via one six-port crossbar (see Fig ). Additionally, the number on each sub-cube bar indicates the processor to which the sub-cube bar is initially distributed for processing. Fig. illustrates a mapping scheme where the sub-cube bars are raster-numbered along the pulse dimension. In contrast, the sub-cube bars are raster-numbered along the channel dimension in Fig.. As illustrated in the two examples, the initial mapping of the data prior to pulse compression affects the number of required communications during the data redistribution phase prior to Doppler filtering. In the case where the data cube is raster-numbered along the pulse dimension, six messages, totaling 0 units in size, must be transferred through the interconnection network. By implementing the mapping scheme in Fig., the number of required data transfers increases to twelve, while the total message size expands to units. For this small example, the initial mapping of the sub-cube bars greatly affects the communication overhead that occurs during phases of data repartitioning. To illustrate the impact of scheduling communications during data repartitioning phases, consider the problem depicted in Fig. 5, which is the same problem as shown in Fig.. The left-hand portion of the figure shows the current location of the STAP data cube on the given processors after pulse compression. The data cube on the right-hand side of the figure illustrates the sub-cube bars of the data cube after repartitioning. The coloring scheme indicates the destination of the data for the data prior to the next processing phase. If any part of the sub-cube bar is a different color than its current processor color in the left-hand data cube, the data must be transferred to the corresponding colored destination node. In this example, the repartitioning phase involves transferring six data sets through the interconnection network. If the six messages were sequentially communicated (i.e., no parallel communication) through the network, the completion time ( T c ) would be the sum of the length of each message, which totals 0 network cycles. If two or more messages could be sent through the network concurrently, then the value of T c would be reduced (i.e., below 0). Scheduling the communications for each of the six messages through the interconnection network greatly affects the overall performance (even for this small system consisting of only one crossbar). Fig. shows the six messages, labeled A through F, in the outgoing first-in-first-out (FIFO) message queues of the source s. Each message s destination is indicated by its color code. The number in parenthesis by each message label represents the relative size of the message. The minimal achievable communication time is dependent upon the with the largest communication time of all outgoing and incoming messages. For this example, the minimum possible communication time is the sum of all outgoing and incoming messages on the s having two messages, which equals fourteen message units. The actual communication time, T c, that would result from this example with the given message queue orderings (i.e., schedule) is units. However, changing the ordering of the messages in the outgoing queues will yield an optimal schedule of

messages. The message queues in Fig. are identical to those in Fig. except the positions of messages C and F have been swapped in the outgoing queue. Swapping the ordering of the messages on the green allows for an increase in the number of messages that can be communicated in parallel. For this new ordering of queued messages, the actual completion time achieves the optimal completion time of fourteen units. The purpose of this example is to illustrate that the order (i.e., the schedule) in which the messages are queued for transmission can impact how much (if any) concurrent communication can occur. The method used to decompose and map the data onto the s will also impact the potential for concurrent communication. The current research involves the design and implementation of a network simulator that will model the effects of data mapping and communication scheduling on the performance of a STAP algorithm on an embedded high-performance computing platform. The purpose of the simulator is not to optimally solve the data mapping and scheduling problems, but to simulate the different data mappings and schedules and resultant performance. Thus, the simulator models the effects associated with how the data is mapped onto s, composed of more that one CE, of an embedded parallel system, and how the data transfers are scheduled. The network simulator is designed in an object-oriented paradigm and implemented in Java using Borland s JBuilder Professional version.0. Java was chosen over other programming languages because of its added benefits. First, Java code is portable. This feature allows the simulator to run on various platforms regardless of the architecture and operating system. Additionally, Java can be used to create both applications (i.e., a program that executes on a local computer) and applets (i.e., an application that is designed to be transmitted over the Internet and executed by a Java-compatible web browser). Third, Java source code is written entirely in an object-oriented paradigm, which is well-suited for the simulator s design. Fourth, Java provides built-in support for multithreaded programming. Finally, Java development tools, like Borland s JBuilder, provide a set of tools in the Abstract Window Toolkit (AWT) for visually designing and creating graphical user interfaces (GUIs) for applications or applets. The simulator s functionality is encompassed by a friendly GUI. The main user interface of the simulator provides a facility for the user to enter the corresponding values of the three dimensions of a given STAP data cube and the number of s to allocate to processing the STAP data cube using an element-space post-doppler heuristic and a sub-cube bar partitioning scheme. After providing the problem definition information, the user selects an initial mapping that includes a set of predefined mappings (e.g., raster-numbering along the pulse dimension, rasternumbering along the channel, etc.), a random mapping, or a user-definable customized mapping. Furthermore, the user selects the ordering of the messages in the outgoing queues from a predefined set of scheduling algorithms (e.g., short messages first, longest messages first, random, custom, etc). After providing the necessary input, the network simulator simulates the defined problem and produces the timing results from both phases of data repartitioning. The level of detail that the simulator models could be defined as a medium- to fine-grained simulation of the interconnection network. The simulator assumes the network is circuit switched, and the contention resolution scheme is based on a port number tie-breaking mechanism

to avoid deadlocks. In addition, the simulator incorporates a novel and efficient method of evaluating blocked messages within the interconnection network and queued messages waiting for transfer. The simulator can be used as a tool for collecting and analyzing how the performance of a system is affected by changing the mapping and scheduling. If it is determined that mapping and/or scheduling choices have a significant impact on performance, then the simulator will serve as a basis for future research in determining the optimal mappings and communications. Acknowledgements This work was supported by Rome Laboratory under Grant No. F00---00 and Defense Advanced Research Projects Agency (DARPA) under Contract No. F00---0. Pulse Compression Partitioning 5 Re-partitioning 5 Doppler Filtering )LJÃÃ$ÃGDWDÃFXEHÃSDUWLWLRQLQJÃE\ÃVXEFXEHÃEDUV KLVÃPHWKRGÃRIÃSDUWLWLRQLQJÃZDVÃILUVWÃGHVFULEHGÃLQÃ>@ [] M. F. Skalabrin and T. H. Einstein, STAP Processing on a Multicomputer: Distribution of -D Data Sets and Processor Allocation for Optimum Interprocessor Communication, Proceedings of the Adaptive Sensor Array Processing (ASAP) Workshop, March.

Data Set Re-Partitioning Prior to Doppler Filtering with raster ordering in the pulse dimension 5 Re-Partitioning 5 Required Data Transfers Total Message Size Count = + + + + + = 0 units )LJÃÃÃ'DWDÃVHWÃUHSDUWLWLRQLQJÃZLWKÃUDVWHUQXPEHULQJÃDORQJÃWKHÃSXOVHÃGLPHQVLRQ Data Set Re-Partitioning Prior to Doppler Filtering with raster ordering in the channel dimension 5 Required Data Transfers Total Message Size Count = units Re-Partitioning 5 )LJÃÃÃ'DWDÃVHWÃUHSDUWLWLRQLQJÃZLWKÃUDVWHUQXPEHULQJÃDORQJÃWKHÃFKDQQHOÃGLPHQVLRQ

Network Interconnection Configuration -Port Crossbar 5 )LJÃÃÃ$QÃH[DPSOHÃFRQILJXUDWLRQÃRIÃDÃIRXUÃ&ÃWZHOYHÃ&(ÃLQWHUFRQQHFWLRQÃQHWZRUN Pulse Compression Doppler Filtering Channel 5 IPC Required Data Transfers Total Message Size Count = 0 units Channel 5 )LJÃÃÃ$QÃH[DPSOHÃRIÃVXEFXEHÃEDUÃUHSDUWLWLRQLQJÃSULRUÃWRÃ'RSSOHUÃILOWHULQJ

Minimum Communication Time: T min (,,, ) = T c ( ) = T c ( ) -Port Crossbar T c ( ) = T outgoing + T incoming = ( + ) + ( + ) = message cycles Actual Communication Time: T (0) = max[t p ( A ), T p ( D )] = m.c. A () B () C () D () T () = T p ( B ) = m.c. T () = T p ( E ) = m.c. E () F () T () = T p ( C ) = m.c. T () = T p ( F ) = m.c. Outgoing Message Queues T c = message cycles )LJÃÃÃ$ÃVXERSWLPDOÃFRPPXQLFDWLRQÃVFKHGXOLQJÃH[DPSOH -Port Crossbar Actual Communication Time: T (0) = max [T p ( A ), T p ( F )] = m.c. T () = max [T p ( B ), T p ( D )] = m.c. A () B () F () D () T () = T p ( E ) = m.c. T () = T p ( C ) = m.c. E () C () T c = message cycles Outgoing Message Queues )LJÃÃÃ$QÃRSWLPDOÃFRPPXQLFDWLRQÃVFKHGXOLQJÃH[DPSOH