Matrix-Matrix Multiplication Using Systolic Array Architecture in Bluespec

Similar documents
Microprocessors and Microsystems

Bit-level Arithmetic Optimization for Carry-Save Additions

Connectivity in Fuzzy Soft graph and its Complement

Session 4.2. Switching planning. Switching/Routing planning

Cluster ( Vehicle Example. Cluster analysis ( Terminology. Vehicle Clusters. Why cluster?

Interval uncertain optimization of structures using Chebyshev meta-models

Programming in Fortran 90 : 2017/2018

Implementing Lattice Boltzmann Computation on Graphics Hardware

Semi-analytic Evaluation of Quality of Service Parameters in Multihop Networks

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Measurement and Calibration of High Accuracy Spherical Joints

A Fast Way to Produce Optimal Fixed-Depth Decision Trees

Bilateral Mesh Denoising

Lecture 5: Multilayer Perceptrons

Design Level Performance Modeling of Component-based Applications. Yan Liu, Alan Fekete School of Information Technologies University of Sydney

Hermite Splines in Lie Groups as Products of Geodesics

Optimal shape and location of piezoelectric materials for topology optimization of flextensional actuators

Boosting Weighted Linear Discriminant Analysis

Performance Evaluation of TreeQ and LVQ Classifiers for Music Information Retrieval

Progressive scan conversion based on edge-dependent interpolation using fuzzy logic

Analysis of Continuous Beams in General

LS-TaSC Version 2.1. Willem Roux Livermore Software Technology Corporation, Livermore, CA, USA. Abstract

Clustering Data. Clustering Methods. The clustering problem: Given a set of objects, find groups of similar objects

Time Synchronization in WSN: A survey Vikram Singh, Satyendra Sharma, Dr. T. P. Sharma NIT Hamirpur, India

An Optimal Algorithm for Prufer Codes *

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Adaptive Class Preserving Representation for Image Classification

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

A New Tool: Solution Boxes of Inequality

Multiscale Heterogeneous Modeling with Surfacelets

Evaluation of Segmentation in Magnetic Resonance Images Using k-means and Fuzzy c-means Clustering Algorithms

Color Texture Classification using Modified Local Binary Patterns based on Intensity and Color Information

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

THE low-density parity-check (LDPC) code is getting

Performance Analysis of Hybrid (supervised and unsupervised) method for multiclass data set

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Storage Binding in RTL synthesis

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

3D vector computer graphics

International Journal of Pharma and Bio Sciences HYBRID CLUSTERING ALGORITHM USING POSSIBILISTIC ROUGH C-MEANS ABSTRACT

Multilabel Classification with Meta-level Features

This fact makes it difficult to evaluate the cost function to be minimized

ABHELSINKI UNIVERSITY OF TECHNOLOGY Networking Laboratory

Steganalysis of DCT-Embedding Based Adaptive Steganography and YASS

Avatar Face Recognition using Wavelet Transform and Hierarchical Multi-scale LBP

Problem Set 3 Solutions

A Novel Dynamic and Scalable Caching Algorithm of Proxy Server for Multimedia Objects

Minimize Congestion for Random-Walks in Networks via Local Adaptive Congestion Control

Pattern Classification: An Improvement Using Combination of VQ and PCA Based Techniques

AADL : about scheduling analysis

Active Contours/Snakes

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Wishing you all a Total Quality New Year!

A Toolbox for Easily Calibrating Omnidirectional Cameras

Scalable Parametric Runtime Monitoring

A MPAA-Based Iterative Clustering Algorithm Augmented by Nearest Neighbors Search for Time-Series Data Streams

S1 Note. Basis functions.

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

The stream cipher MICKEY-128 (version 1) Algorithm specification issue 1.0

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Parallel matrix-vector multiplication

CMPS 10 Introduction to Computer Science Lecture Notes

OSSM Ordered Sequence Set Mining for Maximal Length Frequent Sequences A Hybrid Bottom-Up-Down Approach

Efficient automatic correction and segmentation based 3D visualization of magnetic resonance images

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

A Real-Time Detecting Algorithm for Tracking Community Structure of Dynamic Networks

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

Parallel Numerics. 1 Preconditioning & Iterative Solvers (From 2016)

LOCAL BINARY PATTERNS AND ITS VARIANTS FOR FACE RECOGNITION

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

ELEC 377 Operating Systems. Week 6 Class 3

Machine Learning 9. week

Research on Neural Network Model Based on Subtraction Clustering and Its Applications

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

EXPRESSION OF DUAL EULER PARAMETERS USING THE DUAL RODRIGUES PARAMETERS AND THEIR APPLICATION TO THE SCREW TRANSFORMATION

Module Management Tool in Software Development Organizations

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Loop Transformations, Dependences, and Parallelization

A GENETIC APPROACH FOR THE AUTOMATIC ADAPTATION OF SEGMENTATION PARAMETERS

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

The Simulation of Electromagnetic Suspension System Based on the Finite Element Analysis

Pairwise Identity Verification via Linear Concentrative Metric Learning

Swarm intelligence based dynamic obstacle avoidance for mobile robots under unknown environment using WSN


The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

AVideoStabilizationMethodbasedonInterFrameImageMatchingScore

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Topology Design using LS-TaSC Version 2 and LS-DYNA

Computing Cloud Cover Fraction in Satellite Images using Deep Extreme Learning Machine

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Multi-scale and Discriminative Part Detectors Based Features for Multi-label Image Classification

FUZZY SEGMENTATION IN IMAGE PROCESSING

arxiv: v3 [cs.cv] 31 Oct 2016

LECTURE : MANIFOLD LEARNING

CS 534: Computer Vision Model Fitting

Pipelined Multipliers for Reconfigurable Hardware

Ecient Computation of the Most Probable Motion from Fuzzy. Moshe Ben-Ezra Shmuel Peleg Michael Werman. The Hebrew University of Jerusalem

Transcription:

Matrx-Matrx Multplaton Usng Systol Array Arhteture n Bluespe Team SegFault Chatanya Peddawad (EEB096), Aman Goel (EEB087), heera B (EEB090) Ot. 25, 205 Theoretal Bakground. Matrx-Matrx Multplaton on Hardware Computng matrx produts s oth a entral operaton n many numeral algorthms and potentally tme onsumng, makng t one of the most well-studed prolems n numeral omputng. Varous algorthms have een devsed for omputng C = AB, espeally for large matres. Mappng suh algorthms to ustom or general purpose hardware arhteture s always a hallengng task. By havng a ustom or ASIC hardware, the matrx-matrx multplaton an e mplemented usng least resoures and an e aelerated to a large extent. Mappng the same algorthms on general purpose hardware, for example, mplementng on general purpose Xlnx FPGA oard always has nherent trade-offs suh as area (power), tme (maxmum operatng lok frequeny), lateny, hardware utlzaton effeny and so on. The realst way to ompare two solutons would e to assgn weghts to eah of these fators and hoose a soluton among multple possle pareto optmal solutons..2 Systol Array Arhteture Systol arhtetures (also referred to as systol arrays) represent a network of proessng elements (s) that rhythmally ompute and pass data through the system. These s regularly pump data n and out suh that a regular flow of data s mantaned [], [2]. As a result, systol systems feature two mportant propertes for VLSI desgn: Modularty: Varous funtonal loks whh make up the larger system have well-defned funtons and nterfaes. Hene, the onept of modularty enales the parallelsaton of the desgn proess. Regularty: Herarhal deomposton of a large system results n not only smple, ut also smlar loks, as muh as possle. The systol array may e used as a oproessor n omnaton wth a host omputer where the data samples reeved from the host omputer pass through the s and the fnal result s returned to the host omputer (see Fg. ). Ths operaton s analogous to the flow of lood through the heart, thus the name systol. Typally, all the s n a systol array are unform and fully ppelned,.e., all ommunatng edges among the s ontan delay elements, and the whole system usually ontans only loal nteronnetons [3]. However, some relaxatons have een ntrodued to nrease the utlty of systol arrays. These relaxatons nlude use of not only loal ut also neghor (near, ut not nearest) nteronnetons, use of data roadast operatons, and use of dfferent s n the system, espeally at the oundares. Wth these relaxatons, a famly of modular, regular, and effent data-drven array arhtetures an e desgned for SP applatons, one of whh s matrx-matrx multplaton.

CS6230: CA for VLSI Systol Array & Bluespe Host Proessor The Systol Array Fgure : Bas prnple of a systol system.3 Systol Array esgn Methodology We use the systol arhteture desgn methodology where many systol arhtetures an e desgned for any gven regular teratve algorthm usng lnear mappng or proeton tehnques. The dependeny graph (G) orresponds to a spae representaton where no tme nstane s assgned to any omputaton. Typally ths orresponds to a t = 0 plane. The mappng tehnque transforms a spae representaton to a spae-tme representaton where eah node s mapped to a ertan proessng element and s sheduled to a ertan tme nstane. The systol desgn methodology that we are adoptng here maps a 3-dmensonal G to a or 2 systol arhteture. Now we defne the as vetors nvolved n the systol array desgn: [ ] d Proeton vetor (also alled teraton vetor), d = : Two nodes that are dsplaed y d or d 2 multples of d are exeuted y the same proessor ] Proessor spae vetor, p T = [p p 2 : Any node wth the ndex I T = ] [ y proessor p T I = p Shedulng vetor, s T = p 2 ] [ [ ] would e exeuted [s s 2 ] : Any node wth ndex I would e exeuted at tme s T I. Hardware Utlzaton Effeny, HU E = / s T d : Ths s eause two tasks exeuted y the same proessor are spaed s T d tme unts apart. These aforementoned vetors must satsfy the feaslty onstrants stated elow: Proessor spae vetor and the proeton vetor must e orthogonal to eah other. If ponts A and B dffer y the proeton vetor,.e., I A I B s same as d, then they must e exeuted y the same proessor. In other words, p T I A = p T I B. Ths leads to p T (I A I B ) = 0 = p T d = 0. If A and B are mapped to the same proessor, then they annot e exeuted at the same tme,.e., s T I A s T I B,.e., s T d 0. Edge mappng: If an edge e exsts n the spae representaton or G, then an edge p T e s ntrodued n the systol array wth s T e delays. Gven 2 matres A and B, we an denote ther produt as C = AB, where A, B and C are n n matres. For n = 2, we have [ ] [ ] [ ] 2 a = a 2 2 2 22 a 2 a 22 2 22 2

CS6230: CA for VLSI Systol Array & Bluespe = a + a 2 2 2 = a 2 + a 2 22 2 = a 2 + a 22 2 22 = a 2 2 + a 22 22 These equatons an e represented n a spae representaton as shown n Fg. 2. 2 22 22 2 2 a 2 a 22 2 k 0 0 0 a 0 a 2 Fgure 2: Systol array arhteture of the matrx produt omputaton From the spae dagram, we an wrte the teraton n standard output regular teratve algorthm (RIA) form as follows: a(,, k) = a(,, k) (,, k) = (,, k) (,, k) = (,, k ) + a(,, k)(,, k) Wth lnear mappng, ths 3 spae representaton s mapped onto 2 spae to desgn 2 systol arrays for matrx-matrx multplaton. Wth dfferent hoe of proessor vetor (d), proeton vetor (p T ) and shedulng vetor (s T ) that satsfy the shedulng onstrants, we get dfferent edge mappng hene dfferent systol array arhteture. Some of dfferent possle solutons are derved n Ta.. Systol array arhteture, Arh and Arh 2 usng general proessor elements are drawn n Fg. 3. Tale : fferent solutons to systol array arhteture Vetor Arh Arh 2 Arh 3 Arh 4 ] [ ] [ ] [ ] s [ T [ ] [ ] [ ] [ ] p T 0 0 0 0 0 0 0 0 0 0 [ ] T [ ] T [ ] T [ ] T d 0 0 0 e p T e s T e p T e s T e p T e s T e p T e s T e a(0,, 0) (0 ) (0 ) (0 ) (0 ) (, 0, 0) ( 0) ( 0) ( 0) ( 0) (0, 0, ) (0 0) ( ) ( 0) ( ) 3

CS6230: CA for VLSI Systol Array & Bluespe a a a a a a (a) Arh () Arh 2 Fgure 3: Two-dmensonal systol array for matrx-matrx multplaton 2 Our Implementaton: esgn & Evaluaton of fferent Arhtetures We mplemented 4 dfferent solutons for matrx-matrx multplaton, rght from mplementng on one proessor element to mplementng on 2 array of proessor elements. The desgn supports mulple matrx-matrx multplatons n a sngle streth n ontnuous fashon. We oded and smulated all four solutons n Bluespe to verfy the auray of eah of the solutons. We further syntheszed the solutons to ompare dfferent trade-offs on hardware mplementaton eah of them faed n terms of maxmum operatng lok frequeny (hange of rtal path), area (hardware utlzaton) and theoretal throughput. The solutons are stated elow: Soluton : Perform matrx-matrx multplaton usng 2 systol array of proessor elements. Soluton 2: Perform matrx-matrx multplaton usng lnear array of proessor elements. Soluton 3: Perform matrx-matrx multplaton usng sngle proessor element. Soluton 4: Perform matrx-matrx multplaton usng lnear dretonal 2 systol array of proessor elements. Note that the matrx-matrx multplaton synthess results were otaned for the 5 5 matres wth performng 4 suh multplatons one after other n a sngle run. 2. Soluton Ths soluton s mplemented usng 2 systol array of proessor elements, whh s nothng ut systol array of sequental multply and aumulate (MAC) unts. In one step, K 2 MAC unts performs K 2 multplatons of two numers a k and k and aumulaton f applale. But, sne the outputs propagate form one to another, for multplaton of two K K matres to omplete, the numer of steps requred s 3K 2. The arhteture s shown n Fg. 4. Synthess Results Sle log utlzaton: Numer of sle regsters: 508/26800 = 4% Numer of sle LUTs: 5632/63400 = 8% Numer used as log: 5632/26800 = 8% Mnmum perod : 3.008 ns ( Maxmum frequeny: 76.876 MHz). 4

CS6230: CA for VLSI Systol Array & Bluespe Crtal path: From matnum 6 (FF) to out 66 (FF) elay = 3.008 ns Levels of log = 23 a a a MAC unt a n- n MAC unt: n = n- + a () Operaton of MAC unt (a) Soluton Fgure 4: Arhteture for soluton : K = 3 2.2 Soluton 2 Ths soluton s mplemented usng lnear array of K proessor elements, whh s nothng ut array of sequental multply and aumulate (MAC) unts. In one step, K MAC unts performs K multplatons of two numers a k and k and aumulaton f applale. Hene, for multplaton of two K K matres, the numer of steps requred s K 2. The arhteture s shown n Fg. 5. MAC unt a Fgure 5: Arhteture for soluton 2: K = 3 Synthess Results Sle log utlzaton: Numer of sle regsters: 3689/26800 = 2% Numer of sle LUTs: 6986/63400 = 26% Numer used as log: 6986/26800 = 26% Mnmum perod : 2.008 ns ( Maxmum frequeny: 83.278 MHz). Crtal path: From matnum 6 (FF) to out 0 (FF) elay = 2.008 ns Levels of log = 5

CS6230: CA for VLSI Systol Array & Bluespe 2.3 Soluton 3 Ths soluton s mplemented usng only one proessor element, whh s nothng ut sequental multply and aumulate (MAC) unt. In one step, the MAC unt performs sngle multplaton of two numers a k and k and aumulaton f applale. Hene, for multplaton of two K K matres, the numer of steps requred s K 3. The arhteture s shown n Fg. 6. a MAC Fgure 6: Arhteture for soluton 3 Synthess Results Sle log utlzaton: Numer of sle regsters: 3333/26800 = 2% Numer of sle LUTs: 4798/63400 = 7% Numer used as log: 4798/26800 = 7% Mnmum perod : 0.938 ns ( Maxmum frequeny: 9.424 MHz). Crtal path: From matnum 6 (FF) to out 0 (FF) elay = 0.938 ns Levels of log = 0 2.4 Soluton 4 Ths soluton s mplemented usng K Bdretonal Lnear Systol Arrays (BLSA) of omnatonal s that takes 3 nput a, and and produes output + a, desred n [4]. Note that omnatonal s are used for the mplementaton and performane omparson aganst the sequental ounterpart that has een mplemented and used n frst 3 solutons. It an e shown that for multplaton of two K K matres to omplete, the numer of steps requred s 3K 2. Note that K olumns of the output matrx are omputed smultaneously, where one lnear array omputes one olumn eah. Hene K BLSA strutures are repeated regularly and work ndependent of eah other. The arhteture s shown n Fg. 7. a 23 a 22 a 3 a 33 a 2 a 0 0 a 2 a 32 a3 0 2 3 2 3 2 3 n Fgure 7: Arhteture for soluton 4: K = 3 a out = n + a out 6

CS6230: CA for VLSI Systol Array & Bluespe Synthess Results Sle log utlzaton: Numer of sle regsters: 207/26800 = 0.95% Numer of sle LUTs: 647/63400 = % Numer used as log: 647/26800 = % Mnmum perod : 9.374 ns ( Maxmum frequeny: 06.657 MHz). Crtal path: From mas 0 /Mmult ans a MUL ans elay = 9.374 ns Levels of log = 8 d (SP) to reg 0 3 (FF) 2.5 Pareto Curve: Fae-off Between Solutons Arhteture Sle Reg. Utlzaton Log LUTs Utlzaton Clok Frequeny Throughput Soluton 4% 8% 76.87 MHz 3K 2 Soluton 2 2% 26% 83.27 MHz K 2 Soluton 3 2% 7% 9.42 MHz K 3 Soluton 4 0.95% % 06.67 MHz 3K 2 Tale 2: Trade-offs n dfferent solutons (K = 5) The trade-offs are lear from the Ta. 2. Note that soluton uses ustom sequental MAC unts whle soluton 4 uses default omnatonal proessor elements, hene sgnfant dfferene n hardware utlzaton and the lok perod. Solutons -3 use sequental s hene are nluded n pareto urve where we evaluate the solutons ased on trade-off etween hardware utlzaton and throughput, or etween lok freqeuny and throughput. Ether way, all three are pareto optmal. The pareto urve s drawn n Fg. 8. Clok Frequeny (MHz) 9.4 83.2 76.8 sol n :3 sol n :2 sol n : 0.008 0.04 0.077 Throughput ( no. of steps ) for K = 5 Fgure 8: Pareto Curve (not to sale) 3 Conluson We have explored four dfferent solutons to mplement matrx-matrx multplaton. Seleton of a soluton to one partular applaton depends on several fators suh as dmensons of matrx, onstrants on throughput, resoure utlzaton and makespan as demanded y the applaton. On a general perepton we an say that for smaller matres wth strngent hardware utlzaton onstrant, soluton 3 performs 7

CS6230: CA for VLSI Systol Array & Bluespe good. On the other hand to get etter throughput at the ost of extra hardware, soluton wll e more optmal ompared to soluton 3. Soluton 2 assumes a mddle ground etween soluton and soluton 3. However we an see that the est results are otaned when the proessng element s mplemented usng omnatonal log as n soluton 4. Smlar results an also e otaned from soluton f the as s omnatonal. Whle soluton 4 demands more omplated shedulng of the nputs, soluton requres a smple sequenng of nputs and an thus lead to nterestng results f the as of soluton s mplemented usng omnatonal log. Referenes [] H. T. Kung and C. E. Leserson, Systol arrays (for VLSI), Sparse Matrx Symposum, SIAM, pp. 256 282, 978. [2] H. T. Kung, Why systol arhtetures? IEEE Computers Magazne, vol. 5, pp. 37 45, Jan. 982. [3] S. Y. Kung, VLSI Array Proessors, Prente Hall, 988. [4] E. I. Mlovanov et. al, Matrx Multplaton on Lnear Bdretonal Systol Arrays, Ser. A: Appl. Math. Inform. and Meh, vol. 2, no., pp. 20, 200. 8