Iterative Refinement on FPGAs
|
|
- Emil Farmer
- 5 years ago
- Views:
Transcription
1 Iterative Refinement on FPGAs Tennessee Advanced Computing Laboratory University of Tennessee JunKyu Lee July 19 th 2011 This work was partially supported by the National Science Foundation, grant NSF CHE
2 Floating Point Performance Processors (CPU/GPU): Fast, customized static ALUs FPGAs: -Slower clock, parallel, application specific ALUs -Pipelining -Precision 1) Good performance for single and double for CPU/GPU 2) Can we exploit FPGA flexibility? - Arbitrary precision
3 Benefits from Lower Precision ALUs on FPGAs Lower ALU Precision Higher Smaller ALUs Larger ALUs SPEED UP!! Shorter Wires Shorter Pipeline Clock Rate Number of ALUs in Fixed Area Parallelism Ok, Let us Explore Precisions on FPGAs Which applications? Dense Linear System Solvers : Iterative Refinement on FPGAs Provide high performance according to a prescribed accuracy
4 Dense Linear System Solvers with Arbitrary Accuracy - extended Mixed Precision Iterative Refinement (XMIR) 1. Iterative Refinement Algorithm 2. Implementation of XMIR on FPGAs 3. Performance Comparison with GPGPUs (Xilinx XC6VSX475T vs NVIDIA GTX480) 4. Conclusions
5 Iterative Refinement Step 1: Repeat LU (GEPP) Partial Pivoting i = 0 LU x (1) = b i = i + 1 O(n 3 ), P L Step 2: r (i) = b A x (i) O(n 2 ), P H Step 3: LU z (i) = P r (i) O(n 2 ), P L Step 4: x (i+1) = x (i) + z (i) O(n), P H Until x (i+1) 2 ε x (1) : a zero vector (n 1) A : a square matrix (n n) Note) GEPP: Gaussian Elimination with Partial Pivoting P: Permutation matrix from GEPP P L : Lower precision P H : Higher precision r: Residual vector x: Solution b: Right side vector in system equation (Ax = b) ε :Prescribed Accuracy Computationally Inexpensive
6 Direct Method Matrix Decomposition: 2/3 n 3 Ops A U x = b L LU decomposition with Partial Pivoting (LUPP) n 2 Ops (Forward Substitution) y = b L n 2 Ops (Back Substitution) U x = y Success Condition for Iterative Refinement x x* / x* = q (q<1; for IR), q = ϕ(n)κ(a)ϵ 0 (LUPP)
7 MPIR XMIR Single precision for LUPP Arbitrary precision for LUPP Double precision for LUPP Arbitrary precision for LUPP Double precision refinement Arbitrary precision refinement Converge? Converge? DONE DONE
8 Impact of XMIR Assume that Time Cost Model, T(є) = α m β, Performance Comparison, Let γ = (m L /m H ), MPIR T = T(є D ) (2/3 n 3 γ β + 2 n 2 m (1+γ β ) ) XMIR T = T(є AH ) (2/3 n 3 γ β + 2 n 2 m (1+γ β ) ) If T(A L ) << T(A H ) Achievable Accuracy: XMIR: Arbitrarily High MPIR: Double
9 Dense Linear System Solvers with Flexible Precisions Direct Method 1 precision Wilkinson s Iterative Refinement (WIR) 2 precisions (Original/Higher) Accuracy: Original Precision Mixed Precision IR (MPIR) 2 precisions (S/D(O)) Accuracy: Double Arbitrary Initial Precision MPIR (AMIR) 2 precisions (A/D(O)) Accuracy: Double extended MPIR (XMIR) 2 precisions (A/A(O)) Accuracy: Arbitrary
10 Implementation (Xilinx ISE/VHDL) Step 2. Residual Calculation r = b Ax 0 MSB <= not(msb) sel Microblaze BRAM 0 Status BRAM 1 Status Loading Multiplier Adder Loading b Register File B R A M 0 B R A M 1 PE Partial Sum Register B R A M 0 B R A M 1 Matrix Side BRAMs Vector Side BRAMs Execution time: T = (n/# PEs) (n + k + l + r) / f n 2 /(# PEs f ) 2 Ops per PE per clock cycle, Implementation block size = 1,024
11 Implementation PAR: Xilinx XC5VLX110T Step 2. Residual Calculation Exp Size Precision Mantissa Size Pipeline Depth Add/Mult DSP 48E 8 23(S) 12/8 2/64 Slices Registers LUTs 1496/69, /69,120 # of BRAM (36Kb) # of PEs s 2 PAR CLK GFLOPs 4/ MHz 13 PAR: Xilinx XC6VSX475T Exp Size Precision Mantissa Size Pipeline Depth (Add/Mult) DSP 48E Slices Registers LUTs # of BRAM (36Kb) # of PEs s 2 PAR CLK GFLOPs 8 23(S) 11/8 4/2, /12 5/2, (D) 14/15 13/2, /22 16/2,016 1,278/595,200 1,748/297,600 2,355/595,200 2,807/297,600 2,912/595,200 3,546/297,600 3,816/595,200 4,517/297,600 4/1, MHz 71 6/1, MHz 35 8/1, MHz 21 8/1, MHz 23
12 Implementation Step 3. Triangular System Solver (Block Method) L 11 L 21 L 31 L 22 L 32 L 33 y 1 z 2 = z 2 L 21 y 1 y 2 z 3 = z 3 L 31 y 1 L 32 y 2. z 8 = z 8 L 81 y 1 L 82 y 2.. L 87 y 7. y 1 y 2 y 3 Update z vector block size = 64 z 1 z 2 z 3 L 41 L 42 L 43 L 44 y 4 = z 4 L 51 L 52 L 53 L 54 L 55 y 5 z 5 L 61 L 62 L 63 L 64 L 65 L 66 y 6 z 6 L 71 L 72 L 73 L 74 L 75 L 76 L 77 y 7 z 7 L 81 L 82 L 83 L 84 L 85 L 86 L 87 L 88 y 8 z 8
13 Implementation Step 3. Triangular System Solver (Small triangular matrices) block size = 64 = x 0 = 1/l 00 (b 0 ), x 1 = 1/l 11 (b 1 l 10 x 0 ), x 2 = 1/l 22 (b 2 l 20 x 0 l 21 x 1 ), x 3 = 1/l 33 (b 3 l 30 x 0 l 31 x 1 l 32 x 2 ). z 1 = (b 1 l 10 x 0 ), z 2 = (b 2 l 20 x 0 ), and z 3 = (b 3 l 30 x 0 ). In the next iteration, z 2 = (z 2 l 21 x 1 ), and z 3 = (z 3 l 31 x 1 ), and so on.
14 Implementation Step 3. Triangular System Solver (block size = 64) Arbiter zf done zf data PE Triangular matrix Td data - * ext_enable act Td/zf data T BRAM z BRAM To all the modules Div_inter b vector => Intermediate z vector addr Td Td BRAM Xdone/xdata XADDR addrx/xdatadly Diagonal elements => Final Solution loc_enable Latency from division is hidden 2 operations per clock cycle
15 GFlops Performance Comparison NVIDIA GTX480 Xilinx XC6VSX475T 71 GFlops 47 GFlops FPGA GPU 48 GFlops 32 GFlops 35 GFlops Precision (Mantissa Size) MAGMA v0.2 Exclude data transfer time from host to both accelerators (GPGPU/FPGA)
16 Conclusions 1. XMIR can produce arbitrary accuracy in linear system solvers 2. For applications requiring very high accuracy, impact of XMIR is maximized 3. XMIR (FPGA): Lower Precision / Beyond Double Precision MPIR (GPU): Moderately High Precision Future Work Hybrid-Platform (FPGA + GPU) - Power-Aware Performance Dynamic precision? - Update precisions during iteration Thank You, Any Questions?
AIR: Adaptive Dynamic Precision Iterative Refinement
University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Doctoral Dissertations Graduate School 8-2012 AIR: Adaptive Dynamic Precision Iterative Refinement Jun Kyu Lee jlee57@utk.edu
More informationDeep-Pipelined FPGA Implementation of Ellipse Estimation for Eye Tracking
Deep-Pipelined FPGA Implementation of Ellipse Estimation for Eye Tracking Keisuke Dohi, Yuma Hatanaka, Kazuhiro Negi, Yuichiro Shibata, Kiyoshi Oguri Graduate school of engineering, Nagasaki University,
More informationHigh Throughput Iterative VLSI Architecture for Cholesky Factorization based Matrix Inversion
High Throughput Iterative VLSI Architecture for Cholesky Factorization based Matrix Inversion D. N. Sonawane 1 and M. S. Sutaone 2 1 Department of Instrumentation & Control 2 Department of Electronics
More informationAccelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster
th IEEE International Conference on Computer and Information Technology (CIT ) Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster WANG Lei ZHANG Yunquan
More informationVendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs
Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Xin Fang and Miriam Leeser Dept of Electrical and Computer Eng Northeastern University Boston, Massachusetts 02115
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationPresentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories
HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents
More informationMixed Precision Methods
Mixed Precision Methods Mixed precision, use the lowest precision required to achieve a given accuracy outcome " Improves runtime, reduce power consumption, lower data movement " Reformulate to find correction
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationComputational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science
Computational Methods CMSC/AMSC/MAPL 460 Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Zero elements of first column below 1 st row multiplying 1 st
More informationPerformance and accuracy of hardware-oriented. native-, solvers in FEM simulations
Robert Strzodka, Stanford University Dominik Göddeke, Universität Dortmund Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Number of slices
More informationA High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs
A High-Performance and Energy-efficient Architecture for Floating-point based LU Decomposition on FPGAs Gokul Govindu, Seonil Choi, Viktor Prasanna Dept. of Electrical Engineering-Systems University of
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationA Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video
A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video Amrita Mazumdar Armin Alaghi Jonathan T. Barron David Gallup Luis Ceze Mark Oskin Steven M. Seitz University of Washington Google
More informationFPGA Matrix Multiplier
FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri
More informationIntroduction to Field Programmable Gate Arrays
Introduction to Field Programmable Gate Arrays Lecture 2/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May 9 June 2007 Javier Serrano, CERN AB-CO-HT Outline Digital Signal
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationKeck-Voon LING School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore
MPC on a Chip Keck-Voon LING (ekvling@ntu.edu.sg) School of Electrical and Electronic Engineering Nanyang Technological University (NTU), Singapore EPSRC Project Kick-off Meeting, Imperial College, London,
More informationAn Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs
An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Architecture optimized for Fast Ultra Long FFTs Parallel FFT structure reduces external memory bandwidth requirements Lengths from 32K to
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationQuixilica Floating-Point QR Processor Core
Data sheet Quixilica Floating-Point QR Processor Core With 13 processors on XC2V6000-5 - 20 GFlop/s at 100MHz With 10 processors on XC2V6000-5 - 15 GFlop/s at 97MHz With 4 processors on XC2V3000-5 - 81
More informationPerformance Modeling of Pipelined Linear Algebra Architectures on FPGAs
Performance Modeling of Pipelined Linear Algebra Architectures on FPGAs Sam Skalicky, Sonia López, Marcin Łukowiak, James Letendre, and Matthew Ryan Rochester Institute of Technology, Rochester NY 14623,
More informationBasic FPGA Architectures. Actel FPGAs. PLD Technologies: Antifuse. 3 Digital Systems Implementation Programmable Logic Devices
3 Digital Systems Implementation Programmable Logic Devices Basic FPGA Architectures Why Programmable Logic Devices (PLDs)? Low cost, low risk way of implementing digital circuits as application specific
More informationStreaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs
Computer Science Faculty of EEMCS Streaming Reduction Circuit for Sparse Matrix Vector Multiplication in FPGAs Master thesis August 15, 2008 Supervisor: dr.ir. A.B.J. Kokkeler Committee: dr.ir. A.B.J.
More informationIterative Sparse Triangular Solves for Preconditioning
Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations
More informationDense Matrix Algorithms
Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 2003. Topic Overview Matrix-Vector Multiplication
More informationA Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns
A Configurable Architecture for Sparse LU Decomposition on Matrices with Arbitrary Patterns Xinying Wang, Phillip H. Jones and Joseph Zambreno Department of Electrical and Computer Engineering Iowa State
More informationHigh-Performance Linear Algebra Processor using FPGA
High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible
More informationVendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs
Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Presented by Xin Fang Advisor: Professor Miriam Leeser ECE Department Northeastern University 1 Outline Background
More informationGTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013
GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»
More informationPerformance and accuracy of hardware-oriented native-, solvers in FEM simulations
Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations Dominik Göddeke Angewandte Mathematik und Numerik, Universität Dortmund Acknowledgments Joint
More informationProject Report. 1 Abstract. 2 Algorithms. 2.1 Gaussian elimination without partial pivoting. 2.2 Gaussian elimination with partial pivoting
Project Report Bernardo A. Gonzalez Torres beaugonz@ucsc.edu Abstract The final term project consist of two parts: a Fortran implementation of a linear algebra solver and a Python implementation of a run
More informationSection 3.1 Gaussian Elimination Method (GEM) Key terms
Section 3.1 Gaussian Elimination Method (GEM) Key terms Rectangular systems Consistent system & Inconsistent systems Rank Types of solution sets RREF Upper triangular form & back substitution Nonsingular
More informationPerformance Evaluation of Multiple and Mixed Precision Iterative Refinement Method and its Application to High-Order Implicit Runge-Kutta Method
Performance Evaluation of Multiple and Mixed Precision Iterative Refinement Method and its Application to High-Order Implicit Runge-Kutta Method Tomonori Kouya Shizuoa Institute of Science and Technology,
More informationUser Manual for FC100
Sundance Multiprocessor Technology Limited User Manual Form : QCF42 Date : 6 July 2006 Unit / Module Description: IEEE-754 Floating-point FPGA IP Core Unit / Module Number: FC100 Document Issue Number:
More informationAccelerating Double Precision FEM Simulations with GPUs
Accelerating Double Precision FEM Simulations with GPUs Dominik Göddeke 1 3 Robert Strzodka 2 Stefan Turek 1 dominik.goeddeke@math.uni-dortmund.de 1 Mathematics III: Applied Mathematics and Numerics, University
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationSparse Linear Solver for Power System Analyis using FPGA
Sparse Linear Solver for Power System Analyis using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract Load flow computation and contingency analysis is the foundation of power system analysis.
More informationGPU vs FPGA : A comparative analysis for non-standard precision
GPU vs FPGA : A comparative analysis for non-standard precision Umar Ibrahim Minhas, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic Engineering Imperial College London
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More informationArchitecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier
Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier Sahdev D. Kanjariya VLSI & Embedded Systems Design Gujarat Technological University PG School Ahmedabad,
More informationA High Speed Binary Floating Point Multiplier Using Dadda Algorithm
455 A High Speed Binary Floating Point Multiplier Using Dadda Algorithm B. Jeevan, Asst. Professor, Dept. of E&IE, KITS, Warangal. jeevanbs776@gmail.com S. Narender, M.Tech (VLSI&ES), KITS, Warangal. narender.s446@gmail.com
More informationIterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms
Iterative Algorithms I: Elementary Iterative Methods and the Conjugate Gradient Algorithms By:- Nitin Kamra Indian Institute of Technology, Delhi Advisor:- Prof. Ulrich Reude 1. Introduction to Linear
More informationFPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm
FPGA Implementation of Matrix Inversion Using QRD-RLS Algorithm Marjan Karkooti, Joseph R. Cavallaro Center for imedia Communication, Department of Electrical and Computer Engineering MS-366, Rice University,
More informationHigh-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers
High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationOptimized Design and Implementation of a 16-bit Iterative Logarithmic Multiplier
Optimized Design and Implementation a 16-bit Iterative Logarithmic Multiplier Laxmi Kosta 1, Jaspreet Hora 2, Rupa Tomaskar 3 1 Lecturer, Department Electronic & Telecommunication Engineering, RGCER, Nagpur,India,
More informationReport of Linear Solver Implementation on GPU
Report of Linear Solver Implementation on GPU XIANG LI Abstract As the development of technology and the linear equation solver is used in many aspects such as smart grid, aviation and chemical engineering,
More informationSynthesizable variable bit-width floating-point SystemC Class. Richard Walke Real-Time Systems Lab, QinetiQ LTD, Malvern UK.
Synthesizable variable bit-width floating-point SystemC Class Richard Walke Real-Time Systems Lab, QinetiQ LTD, Malvern UK. walke@signal.qinetiq.com Contents 3 1 Advantages of floating-point arithmetic
More informationExploring the limits of mixed precision FEM based computations on the Tegra-K1 micro-architecture
Exploring the limits of mixed precision FEM based computations on the Tegra-K1 micro-architecture Christoph Höppke, Daniel Tomaschewski TU Dortmund Date: 2016/06/01 Content 1 Mixed precision definition
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationSeries Expansion based Efficient Architectures for Double Precision Floating Point Division
DOI 0.007/s00034-04-98-8 Series Expansion based Efficient Architectures for Double Precision Floating Point Division Manish Kumar Jaiswal Ray C.C. Cheung M. Balakrishnan Kolin Paul Received: 28 October
More informationAn Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs
HPEC 2004 Abstract Submission Dillon Engineering, Inc. www.dilloneng.com An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs Tom Dillon Dillon Engineering, Inc. This presentation outlines
More informationAgenda. Introduction FPGA DSP platforms Design challenges New programming models for FPGAs
New Directions in Programming FPGAs for DSP Dr. Jim Hwang Xilinx, Inc. Agenda Introduction FPGA DSP platforms Design challenges New programming models for FPGAs System Generator Getting your math into
More informationParallel Implementations of Gaussian Elimination
s of Western Michigan University vasilije.perovic@wmich.edu January 27, 2012 CS 6260: in Parallel Linear systems of equations General form of a linear system of equations is given by a 11 x 1 + + a 1n
More informationComputational Methods CMSC/AMSC/MAPL 460. Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science
Computational Methods CMSC/AMSC/MAPL 460 Vectors, Matrices, Linear Systems, LU Decomposition, Ramani Duraiswami, Dept. of Computer Science Some special matrices Matlab code How many operations and memory
More informationA class of communication-avoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines
Available online at www.sciencedirect.com Procedia Computer Science 9 (2012 ) 17 26 International Conference on Computational Science, ICCS 2012 A class of communication-avoiding algorithms for solving
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationLecture 27: Fast Laplacian Solvers
Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall
More informationAn Efficient Implementation of Floating Point Multiplier
An Efficient Implementation of Floating Point Multiplier Mohamed Al-Ashrafy Mentor Graphics Mohamed_Samy@Mentor.com Ashraf Salem Mentor Graphics Ashraf_Salem@Mentor.com Wagdy Anis Communications and Electronics
More informationGPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis
GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis Abstract: Lower upper (LU) factorization for sparse matrices is the most important computing step for circuit simulation
More informationRealization of Hardware Architectures for Householder Transformation based QR Decomposition using Xilinx System Generator Block Sets
IJSTE - International Journal of Science Technology & Engineering Volume 2 Issue 08 February 2016 ISSN (online): 2349-784X Realization of Hardware Architectures for Householder Transformation based QR
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationExploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy
Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) Julie Langou Piotr Luszczek Alfredo Buttari Julien Langou
More informationMAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel
MAGMA Library version 0.1 S. Tomov J. Dongarra V. Volkov J. Demmel 2 -- MAGMA (version 0.1) -- Univ. of Tennessee, Knoxville Univ. of California, Berkeley Univ. of Colorado, Denver June 2009 MAGMA project
More informationNumerical Linear Algebra
Numerical Linear Algebra Probably the simplest kind of problem. Occurs in many contexts, often as part of larger problem. Symbolic manipulation packages can do linear algebra "analytically" (e.g. Mathematica,
More informationDouble Precision Floating-Point Multiplier using Coarse-Grain Units
Double Precision Floating-Point Multiplier using Coarse-Grain Units Rui Duarte INESC-ID/IST/UTL. rduarte@prosys.inesc-id.pt Mário Véstias INESC-ID/ISEL/IPL. mvestias@deetc.isel.ipl.pt Horácio Neto INESC-ID/IST/UTL
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationFast and reliable linear system solutions on new parallel architectures
Fast and reliable linear system solutions on new parallel architectures Marc Baboulin Université Paris-Sud Chaire Inria Saclay Île-de-France Séminaire Aristote - Ecole Polytechnique 15 mai 2013 Marc Baboulin
More informationAnalysis of High-performance Floating-point Arithmetic on FPGAs
Analysis of High-performance Floating-point Arithmetic on FPGAs Gokul Govindu, Ling Zhuo, Seonil Choi and Viktor Prasanna Dept. of Electrical Engineering University of Southern California Los Angeles,
More informationSparse LU Factorization for Parallel Circuit Simulation on GPUs
Department of Electronic Engineering, Tsinghua University Sparse LU Factorization for Parallel Circuit Simulation on GPUs Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Nano-scale Integrated
More informationImplementation of Double Precision Floating Point Multiplier in VHDL
ISSN (O): 2349-7084 International Journal of Computer Engineering In Research Trends Available online at: www.ijcert.org Implementation of Double Precision Floating Point Multiplier in VHDL 1 SUNKARA YAMUNA
More informationAim. Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity. Structure and matrix sparsity: Overview
Aim Structure and matrix sparsity: Part 1 The simplex method: Exploiting sparsity Julian Hall School of Mathematics University of Edinburgh jajhall@ed.ac.uk What should a 2-hour PhD lecture on structure
More informationQuixilica Floating Point FPGA Cores
Data sheet Quixilica Floating Point FPGA Cores Floating Point Adder - 169 MFLOPS* on VirtexE-8 Floating Point Multiplier - 152 MFLOPS* on VirtexE-8 Floating Point Divider - 189 MFLOPS* on VirtexE-8 Floating
More informationCE 601: Numerical Methods Lecture 5. Course Coordinator: Dr. Suresh A. Kartha, Associate Professor, Department of Civil Engineering, IIT Guwahati.
CE 601: Numerical Methods Lecture 5 Course Coordinator: Dr. Suresh A. Kartha, Associate Professor, Department of Civil Engineering, IIT Guwahati. Elimination Methods For a system [A]{x} = {b} where [A]
More informationExploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology
Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation
More informationImplementation Of Quadratic Rotation Decomposition Based Recursive Least Squares Algorithm
157 Implementation Of Quadratic Rotation Decomposition Based Recursive Least Squares Algorithm Manpreet Singh 1, Sandeep Singh Gill 2 1 University College of Engineering, Punjabi University, Patiala-India
More informationFPGA architecture and design technology
CE 435 Embedded Systems Spring 2017 FPGA architecture and design technology Nikos Bellas Computer and Communications Engineering Department University of Thessaly 1 FPGA fabric A generic island-style FPGA
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationAccelerating Linear System Solutions Using Randomization Techniques
Accelerating Linear System Solutions Using Randomization Techniques MARC BABOULIN, Inria Saclay - Île-de-France and University Paris-Sud JACK DONGARRA, University of Tennessee and Oak Ridge National Laboratory,
More informationCore Facts. Documentation Design File Formats. Verification Instantiation Templates Reference Designs & Application Notes Additional Items
(FFT_MIXED) November 26, 2008 Product Specification Dillon Engineering, Inc. 4974 Lincoln Drive Edina, MN USA, 55436 Phone: 952.836.2413 Fax: 952.927.6514 E mail: info@dilloneng.com URL: www.dilloneng.com
More informationPARDISO Version Reference Sheet Fortran
PARDISO Version 5.0.0 1 Reference Sheet Fortran CALL PARDISO(PT, MAXFCT, MNUM, MTYPE, PHASE, N, A, IA, JA, 1 PERM, NRHS, IPARM, MSGLVL, B, X, ERROR, DPARM) 1 Please note that this version differs significantly
More informationTechnical Report TR
Technical Report TR-2012-04 SPIKE::GPU - A GPU-based Banded Linear System Solver Ang Li, Andrew Seidl, Dan Negrut November 15, 2012 Abstract The SPIKE algorithm [1, 2] is an efficient generic divide-and-conquer
More informationA High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation
A High Throughput FPGA-Based Floating Point Conjugate Gradient Implementation Antonio Roldao Lopes and George A Constantinides Electrical & Electronic Engineering, Imperial College London, London SW7 2BT,
More informationA Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection
A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF
More informationTechnical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta
Technical Report 2014-02 Performance Analysis of CULA on different NVIDIA GPU Architectures Prateek Gupta May 20, 2014 1 Spring 2014: Performance Analysis of CULA on different NVIDIA GPU Architectures
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationNew Integer-FFT Multiplication Architectures and Implementations for Accelerating Fully Homomorphic Encryption
New Integer-FFT Multiplication Architectures and Implementations for Accelerating Fully Homomorphic Encryption Xiaolin Cao, Ciara Moore CSIT, ECIT, Queen s University Belfast, Belfast, Northern Ireland,
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationPERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1
PERFORMANCE ANALYSIS OF LOAD FLOW COMPUTATION USING FPGA 1 J. Johnson, P. Vachranukunkiet, S. Tiwari, P. Nagvajara, C. Nwankpa Drexel University Philadelphia, PA Abstract Full-AC load flow constitutes
More informationTHE Lanczos iteration [1] is the key building block
IEEE TRANSACTIONS ON COMPUTERS, VOL. X, NO. Y, Z 213 1 A Low Complexity Scaling Method for the Lanczos Kernel in Fixed-Point Arithmetic Juan L. Jerez, Student Member, IEEE, George A. Constantinides, Senior
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationCost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders
Cost efficient FPGA implementations of Min- Sum and Self-Corrected-Min-Sum decoders Oana Boncalo (1), Alexandru Amaricai (1), Valentin Savin (2) (1) University Politehnica Timisoara, Romania (2) CEA-LETI,
More informationAMS209 Final Project
AMS209 Final Project Xingchen Yu Department of Applied Mathematics and Statistics, University of California, Santa Cruz November 2015 1 Abstract In the project, we explore LU decomposition with or without
More informationHigh Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields
High Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields Santosh Ghosh, Dipanwita Roy Chowdhury, and Abhijit Das Computer Science and Engineering
More informationFPGA Architecture Overview. Generic FPGA Architecture (1) FPGA Architecture
FPGA Architecture Overview dr chris dick dsp chief architect wireless and signal processing group xilinx inc. Generic FPGA Architecture () Generic FPGA architecture consists of an array of logic tiles
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationFPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH
FPGA BASED ACCELERATION OF THE LINPACK BENCHMARK: A HIGH LEVEL CODE TRANSFORMATION APPROACH Kieron Turkington, Konstantinos Masselos, George A. Constantinides Department of Electrical and Electronic Engineering,
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More information