Configurable Multiprocessing: An FIR

Similar documents
Introduction to High level. Synthesis

An introduction to DSP s. Examples of DSP applications Why a DSP? Characteristics of a DSP Architectures

Lecture 9: MIMD Architectures


FAST FIR FILTERS FOR SIMD PROCESSORS WITH LIMITED MEMORY BANDWIDTH

Parallel Processing SIMD, Vector and GPU s cont.

Chapter 2: Memory Hierarchy Design Part 2

Parallel FIR Filters. Chapter 5

Chapter 2 Instructions Sets. Hsung-Pin Chang Department of Computer Science National ChungHsing University

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

Lecture notes for CS Chapter 2, part 1 10/23/18

Implementing FIR Filters

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

The Efficient Implementation of Numerical Integration for FPGA Platforms

Intel. Developer Forum Fall Shihjong Kuo. Strategic Planning/SW Strategy Intel Corporation. Page 1. Copyright 2001 Intel Corporation.

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

A Cache Hierarchy in a Computer System

Data Speculation Support for a Chip Multiprocessor Lance Hammond, Mark Willey, and Kunle Olukotun

ENHANCED TOOLS FOR RISC-V PROCESSOR DEVELOPMENT

Lecture 7: Parallel Processing

How to Use Low-Energy Accelerator on MSP MCUs. Cash Hao Sept 2016

LABORATORIO DI ARCHITETTURE E PROGRAMMAZIONE DEI SISTEMI ELETTRONICI INDUSTRIALI

EE/CSCI 451: Parallel and Distributed Computation

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 2 for Fall Semester, 2007

Introduction to Parallel Programming

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Computer Architecture Crash course

Chapter 2: Memory Hierarchy Design Part 2

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

Lecture 9: MIMD Architectures

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Performance Evaluation CS 0447

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

SIGNALS AND SYSTEMS I Computer Assignment 2

MRPF: An Architectural Transformation for Synthesis of High-Performance and Low-Power Digital Filters

Parallel Processors. Session 1 Introduction

EQUALIZER DESIGN FOR SHAPING THE FREQUENCY CHARACTERISTICS OF DIGITAL VOICE SIGNALS IN IP TELEPHONY. Manpreet Kaur Gakhal

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation

Cache Aware Optimization of Stream Programs

Concurrency for data-intensive applications

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

EECS 452 Lab 2 Basic DSP Using the C5515 ezdsp Stick

Monte Carlo Method on Parallel Computing. Jongsoon Kim

G52CON: Concepts of Concurrency

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE) UNIT-I

Review: Creating a Parallel Program. Programming for Performance

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

The Xilinx XC6200 chip, the software tools and the board development tools

Computer Architecture

Chap. 4 Multiprocessors and Thread-Level Parallelism

Programmable Logic. Any other approaches?

Computer Architecture and Organization

Parallel Computing. Hwansoo Han (SKKU)

04 - DSP Architecture and Microarchitecture

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

Multi-Level Computing Architectures (MLCA)

Implementation of DSP Algorithms

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2006

Two-level Reconfigurable Architecture for High-Performance Signal Processing

CS Computer Architecture

The Cache-Coherence Problem

High Performance Computing

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Code Generation for TMS320C6x in Ptolemy

The Memory System. Components of the Memory System. Problems with the Memory System. A Solution

Multithreading: Exploiting Thread-Level Parallelism within a Processor

REAL TIME DIGITAL SIGNAL PROCESSING

High Performance Computing on GPUs using NVIDIA CUDA

INTRODUCTION TO CATAPULT C


Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs

High Performance Applications on Reconfigurable Clusters

Software Synthesis Trade-offs in Dataflow Representations of DSP Applications

Simultaneous Multithreading (SMT)

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India

A Microprocessor Systems Fall 2009

High Performance Computing in C and C++

StreamIt: A Language for Streaming Applications

Agenda. How can we improve productivity? C++ Bit-accurate datatypes and modeling Using C++ for hardware design

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

Laboratory Exercise 3 Comparative Analysis of Hardware and Emulation Forms of Signed 32-Bit Multiplication

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

High Level Abstractions for Implementation of Software Radios

A framework for verification of Program Control Unit of VLIW processors

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Automatic Instrumentation of Embedded Software for High Level Hardware/Software Co-Simulation

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

System Design and Methodology/ Embedded Systems Design (Modeling and Design of Embedded Systems)

Introduction to Parallel Computing

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Good old days ended in Nov. 2002

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Formal Loop Merging for Signal Transforms

Honorary Professor Supercomputer Education and Research Centre Indian Institute of Science, Bangalore

Transcription:

Configurable Multiprocessing: An FIR Filter Example Cmpware, Inc.

Introduction Multiple processors on a device common Thousands of 32-bit RISC CPUs possible Advantages in: Performance Power consumption Programmability Q: How best to program such devices?

Configurable Microprocessing Multiple standard processor cores Fast point to point interconnection Balanced communication / computation ratio (1:1 or better) Use standard compilers / tools Use Cmpware to design and program the multiprocessor

The Cmpware Software Uses standard compilers / assemblers / etc. Memory Mapped IO for communication Does not break compilers and other tools Does not break processor IP Provides fast, high-bandwidth communication Provides a simple programming model Eclipse based IDE Multiprocessor simulation Multiprocessor state and performance display

Example: FIR Filter Finite Impulse Response (FIR) filter Digital Signal Processing (DSP) filter Often implemented in hardware and software Various methods of parallelizing x(n) x x h(0) h(1) h(2) h(n) x x + + + + y(n)

The FIR Filter Implementation Parameterized on taps and processors FIR() and shift() subroutines standard serial C code Parallel code constructs at highest level, controlling serial subroutines Data read from east and sent out to west Similar structurally to FIR diagrams

Example: FIR Filter void _start(void) { int node; int ntaps; /* Get the parameters */ node = *west; ntaps = *west; /* Send the parameters to the next node */ *east = (node-1); *east = ntaps; for (;;) { *east = FIR(ntaps, *west); *east = shift(ntaps, *west); } /* end for() */ } /* end _start() */

Example: FIR Filter (cont.) /* ** This method computes the FIR. ** ** @param ntaps The number of taps in the ** FIR filter. ** ** @param sum The partial sum into the FIR. ** ** @return This method returns the FIR result. */ int FIR(int ntaps, int sum) { int i; for (i=0; i<ntaps; i++) sum += h[i] * z[i]; return (sum); } /* end FIR() */

Example: FIR Filter (cont.) int shift(int ntaps, int zin) { int i, zout; /* Save the last value (being shifted out) */ zout = z[ntaps-1]; /* Shift the delay line */ for (i=ntaps-2; i>=0; i--) z[i+1] = z[i]; /* Add in the new shifted in value */ z[0] = zin; return (zout); } /* end shift() */

FIR Filter Implementation Three nodes: data source, processor, consumer Single 8-tap FIR (8,1) (8,0) sum sum CMPU z[n] CMPU z[n] CMPU Node 2 Node 1 Data Source FIR(); shift(); Node 0 Results

FIR Filter Implementation (cont.) Four nodes: data source, processor 1, processor 2, consumer 4x2 8-tap FIR (4,2) (4,1) (4,0) sum sum sum CMPU z[n] CMPU z[n] CMPU z[n] CMPU Node 3 Node 2 Data Source FIR(); shift(); Node 1 FIR(); shift(); Node 0 Results

FIR Filter Implementation (cont.) 8 nodes: data source, 8 processors, consumer 8x1 8-tap FIR (1,8) (1,7) (1,1) (1,0) sum sum sum sum CMPU z[n] CMPU z[n] z[n] CMPU z[n] CMPU Node 3 Node 2 Data Source FIR(); shift(); Node 1 FIR(); shift(); Node 0 Results

The Cmpware Software All standard C Same code used for various processor configurations Can select number of processors dynamically Communication synchronize computation Computations only start when inputs are available (pseudo-dataflow) Power / performance tradeoffs unavailable in other solutions

Synchronization FIR filter not simple to parallelize CMP communication synchronizes computation Hardware solution is to: Use slow N-input adder Use pipelined adder tree Add pipeline delay stages All of these HW solutions are very inflexible

FIR Filter Speedup 45 40 35 30 25 20 15 10 5 0 8 Tap FIR Filter Cycles 8 Tap FIR Filter Speedup 65000 60000 55000 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 2 4 8 Processors

Using CMP Parallelism Simplest parallelism at 'task' (subroutine) 1:1 communication / computation permits speedups for very fine grained parallelism Parallelism can be exploited at a very low level (if required) Parallelism at several levels easily exploited Excellent control over performance via level of parallelization

The Cmpware Tools Eclipse IDE based Fast, simple access to multiprocessor data Source level trace Memory Registers Performance / profiling, and more... Pluggable CPU simulator / model generator Fully extensible API

The Cmpware Tools

Configurable Multiprocessing from Cmpware, Inc.