ManArray Processor Interconnection Network: An Introduction
|
|
- Jasper Noel Craig
- 6 years ago
- Views:
Transcription
1 ManArray Processor Interconnection Network: An Introduction Gerald G. Pechanek 1, Stamatis Vassiliadis 2, Nikos Pitsianis 1, 3 1 Billions of Operations Per Second, (BOPS) Inc., Chapel Hill, NC, USA gpechanek@bops.com 2 Delft University of Technology, Department of Electrical Engineering Delft, The Netherlands stamatis@einstein.et.tudelft.nl 3 Duke University, Department of Computer Science Chapel Hill, NC, USA nikos@bops.com Abstract. The present paper introduces the new interconnection network of the BOPS ManArray family of available core products. To form a ManArray network, the processing elements are completely connected within clusters and communicate with members of only two other clusters thereby reducing signal fan-out and wiring density. With this simple network, single-step communications between a hypercube and its compliment node, single-step transpose operations, and a diameter of 2 are achieved. 1 Introduction As chip densities continue to improve, demand increases for the low cost integration of high performance parallel processing systems. For example audio, video, and communication signal processing are but three areas requiring very high performance that are in demand for low cost consumer products. High performance multi-processor array systems, such as the mesh, torus, and hypercube [1, 2, 3, 4] have some characteristics, we feel, that limits their use for System-On-a-Chip products. Consequently, the ManArray family of processor cores was developed for high volume commercial applications with improved network connectivity, a simple implementation scheme, and a simple programming model. This paper introduces the ManArray processor platform as a strong contender to be a ubiquitous, high-volume signal processor in commercial applications. 2 Background In this section we briefly describe some characteristics of conventional networks, which we felt were too limiting for the intended products. A crossbar switch interconnection network is generally known to be expensive to implement since for N processors it has a cost of O(N 2 ). Even on single chip systems, with relatively small N, this wiring, fan-out, and logic delay limits its acceptability for a pervasive scalable approach to single chip multiprocessing. The torus has limited connectivity between processing elements (PEs), which can cause high communication latency effects. Even though new approaches to arrays have been proposed in [5, 6, 7, 8], due to their irregularity of PE combinations there are problems in the implementation and with the generality of connections. P. Amestoy et al. (Eds.): Euro-Par 99, LNCS 1685, pp , Springer-Verlag Berlin Heidelberg 1999
2 762 Gerald G. Pechanek, Stamatis Vassiliadis, and Nikos Pitsianis The hypercube [9, 10] reduces the communication latency from O(n) on a nxn torus to O(logn), the distance between two binary complement nodes. But even O(logn) can represent a high latency on large networks. Reducing the longest path between complement PEs has been deemed difficult and costly. In the next section we present the ManArray network, which alleviates the previously stated concerns by improving the connectivity among the PEs with low implementation expense and low interconnection wiring requirements. 3 ManArray Network The ManArray network achieves the goals of providing higher connectivity than a mesh, torus, or hypercube network, a simple switch implementation for multiple array sizes, and a simple programming model. First we explain how we create the ManArray organization of PEs. Consider by way of example, a 2D 4 x 4 torus and the corresponding embedded 4D hypercube, written as a 4 x 4 table with the hypercube node labels, see Fig. 1A. In Fig. 1A, the PE i,j cluster nodes are labeled in Gray-code as follows: PE G(i),G(j) where G(x) is the Gray code of x. Along the rows and the columns of this table, the distance between adjacent elements is one. If columns 2, 3 and 4 are rotated one position up, then the distance of the corresponding elements between the first and the second column becomes two. Repeating the same rotation with columns 3 and 4 and then column 4, the distance between elements of a column with the corresponding elements of the adjacent columns is two. The resulting 4D ManArray table is shown in Fig. 1B. It is important to note that each row of the table contains a grouping of 4 nodes, including two pairs of diametrically opposite hypercube nodes. In higher dimensional tori, and thus hypercubes, the grouping of diametrically opposite nodes is achieved by the same rotation along each new dimension except the last one. PE-0,0 PE-1,0 PE-2,0 PE-3,0 PE-0, PE-0, PE-1,1 PE-1, PE-2,1 PE-2, PE-3,1 PE-3, Figure 1A PE-0,3 PE-1,3 PE-2,3 PE-3,3 PE-0,0 PE-1,0 PE-2,0 PE-3,0 PE-1, PE-2, PE-2,1 PE-3, PE-3,1 PE-0, PE-0,1 PE-1, Figure 1B PE-3,3 Pe-0,3 PE-1,3 PE-2,3 Using these groupings, it can be shown that the complexity of the ManArray network is small although its connectivity is high. To demonstrate this, we show how this organization of PEs is interconnected with a simple cluster switch network. A 4x4 ManArray with torus and hypercube node Ids, in Fig. 2, consists of four 2x2 clusters. The cluster-switch for the upper left hand 2x2 is shown partitioned into four groups, each consisting of a 4-input and a 3-input multiplexer. Each of these groups is associated with a particular PE and this has been indicated with the dotted line arrows. For example, PE 0,0 is associated with the A group multiplexers a1 and a2. The circled multiplexers are controlled by their associated PE.
3 ManArray Processor Interconnection Network: An Introduction 763 The 4x4 ManArray includes connection paths that connect hypercube complements as shown in Fig. 2. For example, PE 0111 (PE 1,2 ) can communicate with PE (PE 3,0 ) as well as the other members of its cluster. The longest path between hypercube complement PEs, 4 steps for a 4D hypercube, is reduced to 1 step in the ManArray network. The improved connectivity and simplicity of the ManArray network supports single-cycle communications and efficient algorithms. A 0,0 1,3 3, C 2, B d1 a1 a2 b2 b1 c2 c1 3, ,3 0, ,0 d2 D 1, , ,0 1, ,3 3,0 3,3 0, Figure 2 4x4 ManArray highlighting cluster switch control 4 ManArray Network Properties In this section, we discuss some of the properties of the ManArray network within the problem domain of single-chip parallel processors. For the purposes of this paper we constrain our discussion to network sizes that can be implemented on a single chip. With the advancement of technology though, the number of PEs in a ManArray processor scale with the technology allowing larger array sizes to be developed for future products. The network diameter is the largest distance between any pair of nodes and captures the worst case number of steps required for node-to-node communication. The smaller the diameter, the fewer steps needed to communicate between far away nodes. Small network diameters are desirable. As the table below shows, the network diameter of a d-dimensional hypercube is d, and with the addition of the complementary node connections it becomes, d/2. Note that only the edges connecting complementary nodes are accounted for in the middle column. For this introductory paper, we add the third column labeled "ManArray Network", which indicates the number of edges contained in the structure as well as
4 764 Gerald G. Pechanek, Stamatis Vassiliadis, and Nikos Pitsianis the constant network diameter of 2 for current ManArray single chip implementations. In this paper, we show by way of example the connectivity of a 4x4 ManArray (Figure 2). Each 4-PE cluster contains 12 uni-directional edges or 6 bi-directional edges, if you exclude the self-connecting edges. With four clusters, this amounts to 24 bi-directional edges. In ManArray, any PE in a cluster can communicate with any PE in an adjacent cluster. Consequently, there are 16 bidirectional edges between any cluster. The ManArray needs only 8 uni-directional connections between the clusters since that is the maximum number of paths that can be connected between 8 PEs at any one time. By sharing these 8 uni-directional links appropriately with the multiplexers used in the cluster-switches, the 16 bi-directional path combinations can be created. The total number of edges in a 4x4 ManArray is *16 = 88 corresponding to d=4 and k=2 in the table. Hypercube* Hypercube+ ManArray Network compliment edges* Nodes 2 d 2 d 2 d Edges d2 d-1 (d+1)2 d-1 2 2k-1 ((4*3 k-1 )-1); for d=2k 2 2k ((8*3 k-1 )-1); for d=2k+1 Diameter d d/2 2 * A hypercube and a hypercube with complementary edges are proper subgraphs of the ManArray. Where the upper bound on "k" and "d" depends upon the chosen process technology and the processor cycle time requirements. With the full number of ManArray edges provided as shown in the third column above, the network diameter is reduced to a constant diameter of 2 for all d, within the design constraints of the process technology. 5 ManArray Processing Generally speaking, ManArray combines PEs in clusters that also contain a Sequence Processor (SP), uniquely merged into the PE array, and a cluster-switch, interconnecting the PEs. The SP provides program control, contains the instruction and data address generation units, and dispatches instructions to the processor array. Each PE contains five execution units (a multiply accumulate unit, an arithmetic logic unit, a data select unit (DSU), a load unit, and a store unit, supporting various 8/16/32/64-bit packed-data types) a 32x32-bit reconfigurable register file, a VLIW-Instruction-Memory unit, and local data memory. The DSU supports shifts, rotates, and single-cycle PE-to-PE communications across the ManArray network. With the indirect VLIW (ivliw) architecture, the communications operations can be overlapped with the compute operations, thereby providing zero-latency data transfers between PEs. The load and store units provide independent data paths between the local memory in each PE in the array. This allows very high memory bandwidth support for compute-intensive algorithms.
5 ManArray Processor Interconnection Network: An Introduction Conclusions Using the ManArray network, BOPS has implemented an advanced, scalable family of DSP cores for emerging applications such as broadband communications, digital video, digital audio, imaging, and graphics. The BOPS ManArray (Hardware, Software, and Programming Environment) is the culmination of a thorough examination of DSP requirements, dozens of innovative ground-breaking patents, and hundreds of man-years of development effort. The ManArray elegantly provides three basic levels of parallelism (indirectvliw, packed-data, and multi-processing), all independent of each other and available to the compiler or programmer on an asneeded basis. These features are combined in a way which allows a 2x2 ManArray processor to produce a radix 4 distributed 256 point FFT in 425 cycles using complex numbers of 32 bits (16 bits for real and imaginary parts) and an 8x8 IDCT in 34 cycles that meets IEEE standards. And finally, because these emerging markets are primarily System-On-Chip markets, BOPS is providing the ManArray as licensable IP in the form of Cores, SW, and Programming Tools. REFERENCES 1. R. Cypher and J.L.C. Sanz, SIMD Architectures and Algorithms for Image Processing and Computer Vision, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 37, No. 12, pp , Dec K.E. Batcher, Design of a Massively Parallel Processor, IEEE Transactions on Computers, Vol. C-29 No. 9, pp , Sept "Multiprocessor FFTs", P. N. Swarztrauber, Parallel Computing 5, pp , Elzevier Science Publishers B. V. (North-Holland) Ian Foster, "Designing and Building Parallel Programs", 1995 Addison-Wesley Publishing Company, Inc., pp "M.F.A.S.T.: A Single Chip Highly Parallel Image Processing Architecture", G. G. Pechanek, M. Stojancic, S. Vassiliadis, and C. J. Glossner, Proceedings of the IEEE 1995 International Conference on Image Processing, pp , Oct , 1995 Washington,D.C. 6. A Massively Parallel Diagonal Fold Array Processor, G.G. Pechanek et al., 1993 International Conference on Application Specific Array Processors, pp , Oct , 1993, Venice, Italy. 7.``Digital Neural Emulators Using Tree Accumulation and Communication Structures'', G. G. Pechanek, S. Vassiliadis, J. G. Delgado-Frias, IEEE Transactions on Neural Networks Vol. 3, No. 6, pp , Nov Multiple Fold Clustered Processor Torus Array, G.G. Pechanek, et. al., Proceedings Fifth NASA Symposium on VLSI Design, pp , Nov. 4-5, 1993, University of New Mexico, Albuquerque, New Mexico. 9. Robert Cypher and Jorge L.C. Sanz, The SIMD Model of Parallel Computation, 1994 Springer-Verlag, New York, pp F. Thomas Leighton, Introduction To Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, 1992 Morgan Kaufman Publishers, Inc., San Mateo, CA, pp ,.
Methods and apparatus for manifold array processing
( 1 of 5 ) United States Patent Application 20030088754 Kind Code A1 Barry, Edwin F. ; et al. May 8, 2003 Methods and apparatus for manifold array processing Abstract A manifold array topology includes
More informationEfficient complex multiplication and fast fourier transform (FFT) implementation on the ManArray architecture
( 6 of 11 ) United States Patent Application 20040221137 Kind Code Pitsianis, Nikos P. ; et al. November 4, 2004 Efficient complex multiplication and fast fourier transform (FFT) implementation on the
More informationA Rotated Array Clustered Extended Hypercube Processor: The RACE-H Processor
4 A Rotated Array Clustered Extended Hypercube Processor: The RACE-H Processor Gerald G. Pechanek Lightning Hawk Consulting Inc. and Priest & Goldstein, PLLC Mihailo Stojancic ViCore Technologies Inc.
More informationAn Introduction to an Array Memory Processor for Application Specific Acceleration
An Introduction to an Array Memory Processor for Application Specific Acceleration Gerald G. Pechanek Nikos Pitsianis Independent Consultant Department of Electrical and Computer Engineering Department
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationINTERCONNECTION NETWORKS LECTURE 4
INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source
More informationReducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers
Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers Y. Latha Post Graduate Scholar, Indur institute of Engineering & Technology, Siddipet K.Padmavathi Associate. Professor,
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationRecall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms
CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
More informationCOMPARISON OF OCTAGON-CELL NETWORK WITH OTHER INTERCONNECTED NETWORK TOPOLOGIES AND ITS APPLICATIONS
International Journal of Computer Engineering and Applications, Volume VII, Issue II, Part II, COMPARISON OF OCTAGON-CELL NETWORK WITH OTHER INTERCONNECTED NETWORK TOPOLOGIES AND ITS APPLICATIONS Sanjukta
More informationGlobal Scheduler. Global Issue. Global Retire
The Delft-Java Engine: An Introduction C. John Glossner 1;2 and Stamatis Vassiliadis 2 1 Lucent / Bell Labs, Allentown, Pa. 2 Delft University oftechnology, Department of Electrical Engineering Delft,
More informationInterconnect Technology and Computational Speed
Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented
More informationDelay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier
Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier Vivek. V. Babu 1, S. Mary Vijaya Lense 2 1 II ME-VLSI DESIGN & The Rajaas Engineering College Vadakkangulam, Tirunelveli 2 Assistant Professor
More informationCommunication Performance in Network-on-Chips
Communication Performance in Network-on-Chips Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004 Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In
More informationCSC630/CSC730: Parallel Computing
CSC630/CSC730: Parallel Computing Parallel Computing Platforms Chapter 2 (2.4.1 2.4.4) Dr. Joe Zhang PDC-4: Topology 1 Content Parallel computing platforms Logical organization (a programmer s view) Control
More informationCS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99 CS258 S99 2
Real Machines Interconnection Network Topology Design Trade-offs CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Wide links, smaller routing delay Tremendous variation 3/19/99
More informationDistributed-memory Algorithms for Dense Matrices, Vectors, and Arrays
Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 19 25 October 2018 Topics for
More informationSystolic Arrays. Presentation at UCF by Jason HandUber February 12, 2003
Systolic Arrays Presentation at UCF by Jason HandUber February 12, 2003 Presentation Overview Introduction Abstract Intro to Systolic Arrays Importance of Systolic Arrays Necessary Review VLSI, definitions,
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationNode-Disjoint Paths in Hierarchical Hypercube Networks
Node-Disjoint aths in Hierarchical Hypercube Networks Ruei-Yu Wu 1,GerardJ.Chang 2,Gen-HueyChen 1 1 National Taiwan University 2 National Taiwan University Dept. of Computer Science and Information Engineering
More informationTime-Optimal Algorithm for Computing the Diameter of a Point Set on a Completely Overlapping Network
Time-Optimal Algorithm for Computing the Diameter of a Point Set on a Completely Overlapping Network Prapaporn Techa-angkoon and Saowaluk Rattanaudomsawat Abstract- Given a finite set P of n points in
More informationA 90k gate CLB for Parallel Distributed Computing
A 90k gate CLB for Parallel Distributed Computing Bruce chulman 1 and Gerald Pechanek 2 1 BOP, Inc. Palo Alto, CA bruces@bops.com 2 BOP, Inc. Chapel Hill, NC gpechanek@bops.com Abstract. A reconfigurable
More informationPerformance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture
Performance of DB2 Enterprise-Extended Edition on NT with Virtual Interface Architecture Sivakumar Harinath 1, Robert L. Grossman 1, K. Bernhard Schiefer 2, Xun Xue 2, and Sadique Syed 2 1 Laboratory of
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationOnline Hardware Task Scheduling and Placement Algorithm on Partially Reconfigurable Devices
Online Hardware Task Scheduling and Placement Algorithm on Partially Reconfigurable Devices Thomas Marconi, Yi Lu, Koen Bertels, and Georgi Gaydadjiev Computer Engineering Laboratory, EEMCS TU Delft, The
More informationTransposing Arrays on Multicomputers Using de Bruijn Sequences
Transposing Arrays on Multicomputers Using de Bruijn Sequences by Paul N. Swarztrauber 1 J. Parallel Distrib. Comput., 53 (1998) pp. 63-77. ABSTRACT Transposing an N N array that is distributed row- or
More informationThe MorphoSys Parallel Reconfigurable System
The MorphoSys Parallel Reconfigurable System Guangming Lu 1, Hartej Singh 1,Ming-hauLee 1, Nader Bagherzadeh 1, Fadi Kurdahi 1, and Eliseu M.C. Filho 2 1 Department of Electrical and Computer Engineering
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationIncreasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems
Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems Valentin Gies and Thierry M. Bernard ENSTA, 32 Bd Victor 75015, Paris, FRANCE, contact@vgies.com,
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationA Low-Power Carry Skip Adder with Fast Saturation
A Low-Power Carry Skip Adder with Fast Saturation Michael Schulte,3, Kai Chirca,2, John Glossner,2,Suman Mamidi,3, Pablo Balzola, and Stamatis Vassiliadis 2 Sandbridge Technologies, Inc. White Plains,
More informationNode-Independent Spanning Trees in Gaussian Networks
4 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'16 Node-Independent Spanning Trees in Gaussian Networks Z. Hussain 1, B. AlBdaiwi 1, and A. Cerny 1 Computer Science Department, Kuwait University,
More informationEE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing
EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 6c High-Speed Multiplication - III Spring 2017 Koren Part.6c.1 Array Multipliers The two basic operations - generation
More informationExploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors
Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationLinear Arrays. Chapter 7
Linear Arrays Chapter 7 1. Basics for the linear array computational model. a. A diagram for this model is P 1 P 2 P 3... P k b. It is the simplest of all models that allow some form of communication between
More informationPrefix Computation and Sorting in Dual-Cube
Prefix Computation and Sorting in Dual-Cube Yamin Li and Shietung Peng Department of Computer Science Hosei University Tokyo - Japan {yamin, speng}@k.hosei.ac.jp Wanming Chu Department of Computer Hardware
More informationJob Re-Packing for Enhancing the Performance of Gang Scheduling
Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT
More informationA course on Parallel Computer Architecture with Projects Subramaniam Ganesan Oakland University, Rochester, MI
Abstract: A course on Parallel Computer Architecture with Projects Subramaniam Ganesan Oakland University, Rochester, MI 48309 Ganesan@oakland.edu This paper describes integration of simple design projects
More informationThe Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns
The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns H. H. Najaf-abadi 1, H. Sarbazi-Azad 2,1 1 School of Computer Science, IPM, Tehran, Iran. 2 Computer Engineering
More informationImplementation of a FIR Filter on a Partial Reconfigurable Platform
Implementation of a FIR Filter on a Partial Reconfigurable Platform Hanho Lee and Chang-Seok Choi School of Information and Communication Engineering Inha University, Incheon, 402-751, Korea hhlee@inha.ac.kr
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK DESIGN OF QUATERNARY ADDER FOR HIGH SPEED APPLICATIONS MS. PRITI S. KAPSE 1, DR.
More informationJOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: , Volume 2, Issue 7, August 2014
DESIGN OF HIGH SPEED BOOTH ENCODED MULTIPLIER PRAVEENA KAKARLA* *Assistant Professor, Dept. of ECONE, Sree Vidyanikethan Engineering College, A.P., India ABSTRACT This paper presents the design and implementation
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1
More informationCOURSE DESCRIPTION. CS 232 Course Title Computer Organization. Course Coordinators
COURSE DESCRIPTION Dept., Number Semester hours CS 232 Course Title Computer Organization 4 Course Coordinators Badii, Joseph, Nemes 2004-2006 Catalog Description Comparative study of the organization
More informationAn Optical Data -Flow Computer
An ptical Data -Flow Computer Ahmed Louri Department of Electrical and Computer Engineering The University of Arizona Tucson, Arizona 85721 Abstract For many applications, such as signal and image processing,
More informationBalance of Processing and Communication using Sparse Networks
Balance of essing and Communication using Sparse Networks Ville Leppänen and Martti Penttonen Department of Computer Science University of Turku Lemminkäisenkatu 14a, 20520 Turku, Finland and Department
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationHyper-Butterfly Network: A Scalable Optimally Fault Tolerant Architecture
Hyper-Butterfly Network: A Scalable Optimally Fault Tolerant Architecture Wei Shi and Pradip K Srimani Department of Computer Science Colorado State University Ft. Collins, CO 80523 Abstract Bounded degree
More informationgroup 0 group 1 group 2 group 3 (1,0) (1,1) (0,0) (0,1) (1,2) (1,3) (3,0) (3,1) (3,2) (3,3) (2,2) (2,3)
BPC Permutations n The TIS-Hypercube ptoelectronic Computer Sartaj Sahni and Chih-fang Wang Department of Computer and Information Science and ngineering University of Florida Gainesville, FL 32611 fsahni,wangg@cise.u.edu
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationARITHMETIC operations based on residue number systems
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,
More informationCSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing
Dr Izadi CSE-4533 Introduction to Parallel Processing Chapter 4 Models of Parallel Processing Elaborate on the taxonomy of parallel processing from chapter Introduce abstract models of shared and distributed
More informationInterconnection networks
Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory
More informationRCC-Full: An Effective Network for Parallel Computations 1
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 41, 139 155 (1997) ARTICLE NO. PC961295 RCC-Full: An Effective Network for Parallel Computations 1 Mounir Hamdi*,2 and Richard W. Hall *Department of Computer
More informationThe Recursive Dual-net and its Applications
The Recursive Dual-net and its Applications Yamin Li 1, Shietung Peng 1, and Wanming Chu 2 1 Department of Computer Science Hosei University Tokyo 184-8584 Japan {yamin, speng}@k.hosei.ac.jp 2 Department
More informationAn Efficient List-Ranking Algorithm on a Reconfigurable Mesh with Shift Switching
IJCSNS International Journal of Computer Science and Network Security, VOL.7 No.6, June 2007 209 An Efficient List-Ranking Algorithm on a Reconfigurable Mesh with Shift Switching Young-Hak Kim Kumoh National
More informationCS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011
CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252
More informationMULTIPROCESSOR system has been used to improve
arallel Vector rocessing Using Multi Level Orbital DATA Nagi Mekhiel Abstract Many applications use vector operations by applying single instruction to multiple data that map to different locations in
More informationA Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture
A Simple Placement and Routing Algorithm for a Two-Dimensional Computational Origami Architecture Robert S. French April 5, 1989 Abstract Computational origami is a parallel-processing concept in which
More informationA Parallel Algorithm for Minimum Cost Path Computation on Polymorphic Processor Array
A Parallel Algorithm for Minimum Cost Path Computation on Polymorphic Processor Array P. Baglietto, M. Maresca and M. Migliardi DIST - University of Genoa via Opera Pia 13-16145 Genova, Italy email baglietto@dist.unige.it
More informationNetwork Dilation: A Strategy for Building Families of Parallel Processing Architectures Behrooz Parhami
Network Dilation: A Strategy for Building Families of Parallel Processing Architectures Behrooz Parhami Dept. Electrical & Computer Eng. Univ. of California, Santa Barbara Parallel Computer Architecture
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More informationHIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE
International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol. 2, Issue 1, Feb 2015, 01-07 IIST HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC
More informationDesign and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology
Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,
More informationAn Enhanced Mixed-Scaling-Rotation CORDIC algorithm with Weighted Amplifying Factor
SEAS-WP-2016-10-001 An Enhanced Mixed-Scaling-Rotation CORDIC algorithm with Weighted Amplifying Factor Jaina Mehta jaina.mehta@ahduni.edu.in Pratik Trivedi pratik.trivedi@ahduni.edu.in Serial: SEAS-WP-2016-10-001
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 6c High-Speed Multiplication - III Israel Koren Fall 2010 ECE666/Koren Part.6c.1 Array Multipliers
More informationSingle Pass Connected Components Analysis
D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected
More informationRecursive Dual-Net: A New Universal Network for Supercomputers of the Next Generation
Recursive Dual-Net: A New Universal Network for Supercomputers of the Next Generation Yamin Li 1, Shietung Peng 1, and Wanming Chu 2 1 Department of Computer Science Hosei University Tokyo 184-8584 Japan
More informationDynamic Partial Reconfigurable FIR Filter Design
Dynamic Partial Reconfigurable FIR Filter Design Yeong-Jae Oh, Hanho Lee, and Chong-Ho Lee School of Information and Communication Engineering Inha University, Incheon, Korea rokmcno6@gmail.com, {hhlee,
More informationAn Empirical Comparison of Area-Universal and Other Parallel Computing Networks
Loyola University Chicago Loyola ecommons Computer Science: Faculty Publications and Other Works Faculty Publications 9-1996 An Empirical Comparison of Area-Universal and Other Parallel Computing Networks
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationHARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK
DOI: 10.21917/ijct.2012.0092 HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK U. Saravanakumar 1, R. Rangarajan 2 and K. Rajasekar 3 1,3 Department of Electronics and Communication
More informationEfficient Radix-4 and Radix-8 Butterfly Elements
Efficient Radix4 and Radix8 Butterfly Elements Weidong Li and Lars Wanhammar Electronics Systems, Department of Electrical Engineering Linköping University, SE581 83 Linköping, Sweden Tel.: +46 13 28 {1721,
More informationHigh Performance Datacenter Networks
M & C Morgan & Claypool Publishers High Performance Datacenter Networks Architectures, Algorithms, and Opportunity Dennis Abts John Kim SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE Mark D. Hill, Series
More informationInterconnection Networks
Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially
More informationThe Complexity of FFT and Related Butterfly Algorithms on Meshes and Hypermeshes
The Complexity of FFT and Related Butterfly Algorithms on Meshes and Hypermeshes T.H. Szymanski McGill University, Canada Abstract Parallel FFT data-flow graphs based on a Butterfly graph followed by a
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationInternational Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018
RESEARCH ARTICLE DESIGN AND ANALYSIS OF RADIX-16 BOOTH PARTIAL PRODUCT GENERATOR FOR 64-BIT BINARY MULTIPLIERS K.Deepthi 1, Dr.T.Lalith Kumar 2 OPEN ACCESS 1 PG Scholar,Dept. Of ECE,Annamacharya Institute
More informationDesign and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA
Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,
More informationHypercubes. (Chapter Nine)
Hypercubes (Chapter Nine) Mesh Shortcomings: Due to its simplicity and regular structure, the mesh is attractive, both theoretically and practically. A problem with the mesh is that movement of data is
More informationVertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter
Vertical-Horizontal Binary Common Sub- Expression Elimination for Reconfigurable Transposed Form FIR Filter M. Tirumala 1, Dr. M. Padmaja 2 1 M. Tech in VLSI & ES, Student, 2 Professor, Electronics and
More informationEmbedding Large Complete Binary Trees in Hypercubes with Load Balancing
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 35, 104 109 (1996) ARTICLE NO. 0073 Embedding Large Complete Binary Trees in Hypercubes with Load Balancing KEMAL EFE Center for Advanced Computer Studies,
More informationDesign methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts
Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton,
More informationLanguage and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors
Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel
More informationChap. 2 part 1. CIS*3090 Fall Fall 2016 CIS*3090 Parallel Programming 1
Chap. 2 part 1 CIS*3090 Fall 2016 Fall 2016 CIS*3090 Parallel Programming 1 Provocative question (p30) How much do we need to know about the HW to write good par. prog.? Chap. gives HW background knowledge
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationAbstract. Literature Survey. Introduction. A.Radix-2/8 FFT algorithm for length qx2 m DFTs
Implementation of Split Radix algorithm for length 6 m DFT using VLSI J.Nancy, PG Scholar,PSNA College of Engineering and Technology; S.Bharath,Assistant Professor,PSNA College of Engineering and Technology;J.Wilson,Assistant
More informationSlim Fly: A Cost Effective Low-Diameter Network Topology
TORSTEN HOEFLER, MACIEJ BESTA Slim Fly: A Cost Effective Low-Diameter Network Topology Images belong to their creator! NETWORKS, LIMITS, AND DESIGN SPACE Networks cost 25-30% of a large supercomputer Hard
More informationPerformance Assessment of Wavelength Routing Optical Networks with Regular Degree-Three Topologies of Minimum Diameter
Performance Assessment of Wavelength Routing Optical Networks with Regular Degree-Three Topologies of Minimum Diameter RUI M. F. COELHO 1, JOEL J. P. C. RODRIGUES 2, AND MÁRIO M. FREIRE 2 1 Superior Scholl
More informationCS575 Parallel Processing
CS575 Parallel Processing Lecture three: Interconnection Networks Wim Bohm, CSU Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 license.
More informationDIGITAL SIGNAL PROCESSING AND ITS USAGE
DIGITAL SIGNAL PROCESSING AND ITS USAGE BANOTHU MOHAN RESEARCH SCHOLAR OF OPJS UNIVERSITY ABSTRACT High Performance Computing is not the exclusive domain of computational science. Instead, high computational
More informationSC: Prototypes for Interactive Architecture
SC: Prototypes for Interactive Architecture Henriette Bier 1, Kathleen de Bodt 2, and Jerry Galle 3 1 Delft University of Technology, Berlageweg 1, 2628 CR, Delft, The Netherlands 2 Institute of Architectural
More information