AnySP: Anytime Anywhere Anyway Signal Processing
|
|
- Leona Patterson
- 6 years ago
- Views:
Transcription
1 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1, Sangwon Seo 1, Scott Mahlke 1,Trevor Mudge 1, Chaitali Chakrabarti 2, Krisztian Flautner 3 University of Michigan ACAL 1 Arizona State University 2 ARM, Ltd. 3 1
2 The Old Modern Mobile Mobile Phone Phone 2 Video Recording Future phones are becoming more complex Richer applications require much more requirements Video Editing How do phones handle this now? Higher Data Rates 3D Rendering Advanced Image Processing Photos From
3 Inside Today s Smart Phones 3 Inside the OMAP330 Application Processor Inside the X-Gold 608 (Representation of QCOM) 3 3
4 Power/Performance Requirements for Multiple Systems P e r f o r m a n c e ( G o p s ) Different applications have different power/performance characteristics! SODA ( 65 nm ) Mobile HD Video 3 G Wireless G Wireless SODA ( 90 nm ) We need to design keeping each application in mind! (Not GPP but Domain Specific Processor) TI C 6 X Imagine VIRAM Pentium M IBM Cell Power ( Watts )
5 5 The Applications Is there anything we can learn from the applications themselves? 5 5
6 H.26 Basics 6 Standard Basis Patterns Basis Coefficients Reconstructed Macroblock Modules Entropy decoder Inverse Transform Motion Compensation Intra Prediction Deblocking Filter Inverse Transform Past Frames Algorithms Context-adaptive VLD x integer kernel Half: 6-tap filter Quarter: 6-tap / bilinear filter Eighth: -tap filter Directional spatial prediction In-loop adaptive filter Residual Data Current Frame Power Profiling * Motion Compensation Future Frames Form Prediction T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. A low-power dual-mode video decoder for mobile applications, IEEE Communications Magazine, volume, issue 8, pp , Aug % 29% 10% 3% Misc Kernels 23% Decoded Block Deblocking Filter Reconstructed Macroblock Decoded Block Current Uncoded Block Intra-Prediction Prediction Based on Neighbors 6 6
7 G Wireless Basics 7 Ant D/A Guard Insertion IFFT Channel Encoder (LDPC) S/P Transmitted Data Modulator Demodulator Ant A/D Symbol Timing / Guard Detect FFT Channel Estimator MIMO Decoder (STBC) Channel Decoder (LDPC) Recovered Data Three kernels make up the majority of the work FFT Extract Data from Signals STBC Combine Data into More Reliable Stream LDPC Error Correction on Data Stream 7 7
8 8 Mobile Signal Processing Algorithm Characteristics G H.26 Algorithm SIMD Scalar Overhead SIMD Width Amount Workload (%) Workload (%) Workload (%) (Elements) of TLP FFT Low STBC High LDPC Low Deblocking Filter Medium Intra- PredicMon SIMD 85 comes 5 at a cost! Medium Inverse Transform Register 80 File 5 Power 15 8 High MoMon CompensaMon High Data Movement/Alignment Cost SIMD architectures have to deal with this! Algorithms have different SIMD widths From very large to very small Though SIMD width varies all algorithms can exploit it Large percentage of work can be SIMDized Larger SIMD width tend to have less TLP 8 8
9 Only the instructions shown in red are MMX computations. All other instructions are simply supporting these computations. 9 Pentium III SIMD code for Discrete Cosine Transform (DCT) lea ebx, DWORD PTR [ebp+128] load/address overhead mov DWORD PTR [esp+28], ebx load/address overhead $B1$2: xor eax, eax address overhead move dx, ecx address overhead lea edi, DWORD PTR [ecx+16] load/address overhead mov DWORD PTR [esp+2], ecx load/address overhead $B1$3: movq mm1, MMWORD PTR [ebp] load overhead pxor mm0, mm0 initialization overhead pmaddwd mm1, MMWORD PTR [eax+esi] True Computation movq mm2, MMWORD PTR [ebp+8] load overhead pmaddwd mm2, MMWORD PTR [eax+esi+8] True Computation add eax, 16address overhead paddw mm1, mm0 True Computation paddw mm2, mm1 True Computation movq mm0, mm2 load related overhead psrlq mm2, 32 SIMD reduction overhead povd ecx, mm0 SIMD load overhead movd ebx, mm2 SIMD load overhead add ecx, ebx SIMD conversion Overhead mov WORD PTR [edx], cx store overhead add edx, 2 address overhead cmp edi, edx branch related overhead jg $B1$3 loop branch overhead $B1$: move cx, DWORD PTR [esp+2] load/address overhead add ebp, 16 address overhead add ecx, 16 address overhead move ax, DWORD PTR [esp+28] load/address overhead cmp eax, ebp branch related overhead jg $B1$2 loop branch overhead 9 Feb 15, 200
10
11
12
13 Traditional SIMD Power Breakdown 13 SCALAR PIPE 9% INTERCONNECT 2% SODA WCDMA MEM 3% CONTROL 38% REG 37% MEM REG ALU+MULT CONTROL INTERCONNECT SCALAR PIPE ALU+MULT 11% Register File Power consumes a lot of power in traditional 32-wide SIMD architecture 13
14 Register File Access 1 100% 90% 80% 70% 60% 50% 0% 30% 20% 10% 0% Register Access Bypass Read Bypass Write Lots of power wasted on unneeded register file access! FFT STBC LDPC Deblocking Filter Intra- PredicMon Inverse Transform Many of the register file access do not have to go back to the main register file 1 1
15 Instruction Pair Frequency 15 InstrucMon Pair Frequency 1 mul9ply- add 26.71% 2 add- add 13.7% 3 shuffle- add 8.5% shib right- add 6.90% 5 subtract- add 6.9% 6 add- shib right 5.76% 7 mul9ply- subtract.00% 8 shib right- subtract 3.75% 9 add- subtract 3.07% 10 Others 20.5% a ) Intra - prediction and Deblocking Filter Combined InstrucMon Pair Frequency 1 shuffle- move 32.07% 2 abs- subtract 8.5% 3 move- subtract 8.5% shuffle- subtract 3.5% 5 add- shuffle 3.5% 6 Others 3.77% b ) LDPC InstrucMon Pair Frequency 1 shuffle- shuffle 16.67% 2 add- mul9ply 16.67% 3 mul9ply- subtract 16.67% mul9ply- add 16.67% 5 subtract- mult 16.67% 6 shuffle- add 16.67% c ) FFT Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions A few instruction pairs (3-5) make up the majority of all instruction pairs! 15 15
16 Data Alignment Problem! Intra- PredicMon 16 A B C D A B C D E F G H I J x16 luma MB prediction modes for each x block X I J K L A B C D A B C D A B C D A B C D F G H I E F G H D E F G C D E F A B C D A B C D E F G A B C D E F G a b c d Traditional SIMD machines take too e f g h A B C D B C D E C D E F D E F G A E B F B F C G C G D D D i j k l long or A Bcost C D E F too G H I Jmuch to do A Bthis C D E F G H I J K L m n o p G H I J A B C D G H I J A B C D Good news A B C D small fixed number patterns per kernel A A A A B B B B C C C C D D D D I E D C J F I E K G J A B C D E F G H I J K L D D D F L H K G I J K L E F G H D I J K C D E F H.26 Intra-prediction has 9 different prediction modes Each prediction mode requires a specific permutation 16 16
17 Summary 17 Conclusion about G and H.26 Lots of different sized parallelism From wide to 96 wide to 102 wide SIMD Which means many different SIMD widths need to be supported Very short lived values Lots of potential for instruction fusings Limited set of shuffle patterns required for each kernel 17
18 18 AnySP Design 18 18
19 Traditional SIMD Architectures 19 AGU 32- lane SIMD ALU SIMD Data MEM SIMD RF E X 32- lane SSN W B SIMD S T V SIMD to scalar V T S Scalar Data MEM scalar RF E X 16-bit ALU W B Scalar 32-Wide SIMD with Simple Shuffle Network 19 19
20 AnySP Architecture High Level 20 L1 Program Memory Controller DMA Multi-Bank Local Memory Bank 0 Bank 1 Bank C R O S S B A R 16-bit 8-wide 16 entry SIMD RF-0 16-bit 8-wide 16 entry SIMD RF-1 Multi-SIMD Datapath 16 Banked Memory with 8-wide SIMD FFU-0 8-wide SIMD FFU-1 Swizzle Network 16-bit 8- wide -entry Buffer-0 16-bit 8- wide -entry Buffer-1 Bank 8 Groups 2 of 8-Wide 16-bit 8-wide Flexible Bank Function 16 Units entry 8-wide SIMD 3 SIMD RF-2 FFU-2 Multiple Output Adder Tree 128x128 16bit Swizzle Network SRAM-based Crossbar 16-bit 8- wide -entry Buffer-2 Multi- Output Adder Tree 8 Groups of 8-wide SIMD (6 Total Lanes) To Inter-PE Bus Bank bit 8-wide 8-wide SIMD 16 entry Temporary Buffer and SIMD RF-7 FFU-7 Bypass Network 16-bit 8- wide -entry Buffer-7 Scalar Memory Buffer Scalar Pipeline AGU Group 0 Pipeline Datapath AGU and Scalar Pipeline AGU Group 1 Pipeline AGU Group 7 Pipeline AnySP PE 20 20
21 Multi-Width Support 21 Multi-Bank Local Memory Multi-SIMD Datapath AGU Group 0 AGU Group 1 AGU Group 2 AGU Group 3 Bank 0 Bank 1 Bank 2 Bank 3 Bank C R O S S B A R 16-bit 8-wide 16 entry SIMD RF-0 16-bit 8-wide 16 entry SIMD RF-1 16-bit 8-wide 16 entry SIMD RF-2 16-bit 8-wide 16 entry SIMD RF-3 8-wide SIMD FFU-0 8-wide SIMD FFU-1 8-wide SIMD FFU-2 8-wide SIMD FFU-3 Swizzle Network 16-bit 8- wide -entry Buffer-0 16-bit 8- wide -entry Buffer-1 16-bit 8- wide -entry Buffer-2 16-bit 8- wide -entry Buffer-3 Multi- Output Adder Tree AGU Group 7 Bank bit 8-wide 16 entry SIMD RF-7 8-wide SIMD FFU-7 16-bit 8- wide -entry Buffer-7 Each 8-wide SIMD Group works on different memory Normal 6-Wide SIMD mode all lanes share one AGU locations of the same 8-wide code AGU Offsets 21 21
22 AnySP FFU Datapath 22 Flexible Functional Unit allows us to 1. Exploit Pipeline-parallelism by joining two lanes together 2. Handle register bypass and the temporary buffer 3. Join multiple pipelines to process deeper subgraphs. Fuse Instruction Pairs 22 22
23 23 AnySP Results 23 23
24 Simulation Environment Traditional SIMD architecture comparison SODA at 90nm technology 2 AnySP Synthesized at 90nm TSMC Power, timing, area numbers were extracted Performance and Power for each kernel was generated using synthesized data on in-house simulator G based on a NTT DoCoMo G test setup H.26 CIF@30fps 2 2
25 25 AnySP Speedup vs SIMD-based Architecture N o r m a l i z e d S p e e d u p Baseline 6 - Wide Multi - SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass 0. 0 FFT 102 pt Radix - 2 FFT 102 pt Radix - STBC LDPC H. 26 Intra Prediction H. 26 Deblocking Filter H. 26 Inverse Transform H. 26 Motion Compensation For all benchmarks we perform more than 2x better than a SIMD-based architecture 25 25
26 AnySP Energy-Delay vs SIMD-based Architecture 26 N o r m a l i z e d E n e r g y - D e l a y FFT 102 pt Radix - 2 FFT 102 pt Radix - SIMD - based Architecture AnySP STBC LDPC H. 26 Intra Prediction H. 26 Deblocking Filter H. 26 Inverse Transform H. 26 Motion Compenstation More importantly energy efficiency is much better! 26 26
27 AnySP Power Breakdown 27 PE System Total Est. Components SIMD Data Mem ( 32 KB ) SIMD Register File ( 16 x 102 bit ) SIMD ALUs, Multipliers, and SSN SIMD Pipeline + Clock + Routing SIMD Buffer ( 128 B ) SIMD Adder Tree Intra - processor Interconnect Scalar / AGU Pipeline & Misc. ARM ( Cortex - M 3 ) Global Scratchpad Memory ( 128 KB ) Inter - processor Bus with DMA 90 nm ( MHz ) Units Area mm 2 65 nm ( MHz ) nm ( MHz ) Area Area % % % % % % < 1 % % % % % % % G + H. 26 Decoder Power Power mw % % % % % % < 1 % % % < 1 % < 1 % 1. 5 < 1 % % We estimate that both H.26 and G wireless can be done in under 1 Watt at 5nm 27 27
28 Conclusion & Future Work Conclusion We have presented an example architecture that could possibly meet the requirements of 100Mbps G and HD video on the same platform Under the power budget and meeting the performance at 5nm 28 Future and Ongoing Work Application-specific language Larger class of algorithms for AnySP Better utilization of resources for non-parallel kernels Speedup sequential parts University 28 of Michigan - ACAL 28
Low Power DSP Architectures
Overview 1 AnySP SODA++ Increase the Application Domain of the Wireless baseband architecture Diet-SODA SODA- - Taking the sugar out of SODA 1 1 2 Low Power DSP Architectures Trevor Mudge, Bredt Professor
More informationAnySP: Anytime Anywhere Anyway Signal Processing
AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh, Sangwon Seo, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti 2 and Krisztian Flautner 3 Advanced Computer Architecture Laboratory 2 Department
More informationAn Ultra Low Power SIMD Processor for Wireless Devices
An Ultra Low Power SIMD Processor for Wireless Devices Mark Woh 1, Sangwon Seo 1, Chaitali Chakrabarti 2, Scott Mahlke 1 and Trevor Mudge 1 1 Advanced Computer Architecture Laboratory 2 School of Electrical,
More informationANYSP: ANYTIME ANYWHERE ANYWAY SIGNAL PROCESSING
... ANYSP: ANYTIME ANYWHERE ANYWAY SIGNAL PROCESSING... LOOKING FORWARD, THE COMPUTATION REQUIREMENTS OF MOBILE DEVICES WILL INCREASE BY ONE TO TWO ORDERS OF MAGNITUDE, BUT THEIR POWER REQUIREMENTS WILL
More informationARCHITECTURE AND ANALYSIS FOR NEXT GENERATION MOBILE SIGNAL PROCESSING. Mark Woh
ARCHITECTURE AND ANALYSIS FOR NEXT GENERATION MOBILE SIGNAL PROCESSING by Mark Woh A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical
More informationWorkload Characterization: Can it save Computer Architecture and Performance Evaluation?
Workload Characterization: Can it save Computer Architecture and Performance Evaluation? Lizy Kurian John The Laboratory for Computer Architecture (LCA) ECE Department, UT Austin ljohn@ece.utexas.edu (512)-232-1455
More informationA scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment
LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of
More informationCustomizing Wide-SIMD Architectures for H.264
Customizing Wide- Architectures for H.264 S. Seo, M. Woh, S. Mahlke, T. Mudge Department of Electrical and Computer Engineering University of Michigan, Ann Arbor, MI 48109 Email: swseo,mwoh,mahlke,tnm@umich.edu
More informationA System Solution for High-Performance, Low Power SDR
A System Solution for High-Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztián Flautner 2 1 Advanced Computer Architecture Laboratory
More informationAccelerating 3D Geometry Transformation with Intel MMX TM Technology
Accelerating 3D Geometry Transformation with Intel MMX TM Technology ECE 734 Project Report by Pei Qi Yang Wang - 1 - Content 1. Abstract 2. Introduction 2.1 3 -Dimensional Object Geometry Transformation
More informationMultimedia Decoder Using the Nios II Processor
Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra
More informationDynamic Translation for EPIC Architectures
Dynamic Translation for EPIC Architectures David R. Ditzel Chief Architect for Hybrid Computing, VP IAG Intel Corporation Presentation for 8 th Workshop on EPIC Architectures April 24, 2010 1 Dynamic Translation
More informationVLSI Signal Processing
VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface
More informationUsing MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Decoding
Using MMX Instructions to Implement a Synthesis Sub-Band Filter for MPEG Audio Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided
More informationADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-11: 80x86 Architecture
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-11: 80x86 Architecture 1 The 80x86 architecture processors popular since its application in IBM PC (personal computer). 2 First Four generations
More information6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU
1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high
More informationDatapoint 2200 IA-32. main memory. components. implemented by Intel in the Nicholas FitzRoy-Dale
Datapoint 2200 IA-32 Nicholas FitzRoy-Dale At the forefront of the computer revolution - Intel Difficult to explain and impossible to love - Hennessy and Patterson! Released 1970! 2K shift register main
More informationPower Estimation of UVA CS754 CMP Architecture
Introduction Power Estimation of UVA CS754 CMP Architecture Mateja Putic mateja@virginia.edu Early power analysis has become an essential part of determining the feasibility of microprocessor design. As
More informationAdvanced Computer Architecture
Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes
More informationMPEG-4: Simple Profile (SP)
MPEG-4: Simple Profile (SP) I-VOP (Intra-coded rectangular VOP, progressive video format) P-VOP (Inter-coded rectangular VOP, progressive video format) Short Header mode (compatibility with H.263 codec)
More informationMapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis
Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis University of Thessaly Greece 1 Outline Introduction to AVS
More informationEECE.3170: Microprocessor Systems Design I Summer 2017 Homework 4 Solution
1. (40 points) Write the following subroutine in x86 assembly: Recall that: int f(int v1, int v2, int v3) { int x = v1 + v2; urn (x + v3) * (x v3); Subroutine arguments are passed on the stack, and can
More informationProcessor Structure and Function
WEEK 4 + Chapter 14 Processor Structure and Function + Processor Organization Processor Requirements: Fetch instruction The processor reads an instruction from memory (register, cache, main memory) Interpret
More informationUsing MMX Instructions to Compute the AbsoluteDifference in Motion Estimation
Using MMX Instructions to Compute the AbsoluteDifference in Motion Estimation Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided
More informationScalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India
Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India Outline of Presentation MPEG-2 to H.264 Transcoding Need for a multiprocessor
More informationVector IRAM: A Microprocessor Architecture for Media Processing
IRAM: A Microprocessor Architecture for Media Processing Christoforos E. Kozyrakis kozyraki@cs.berkeley.edu CS252 Graduate Computer Architecture February 10, 2000 Outline Motivation for IRAM technology
More informationMAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015
MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015 Video Codecs 70% of internet traffic will be video in 2018 [CISCO] Video
More informationFlexible wireless communication architectures
Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April
More informationIA-32 Architecture COE 205. Computer Organization and Assembly Language. Computer Engineering Department
IA-32 Architecture COE 205 Computer Organization and Assembly Language Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Basic Computer Organization Intel
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationHistory of the Intel 80x86
Intel s IA-32 Architecture Cptr280 Dr Curtis Nelson History of the Intel 80x86 1971 - Intel invents the microprocessor, the 4004 1975-8080 introduced 8-bit microprocessor 1978-8086 introduced 16 bit microprocessor
More informationWilliam Stallings Computer Organization and Architecture 10 th Edition Pearson Education, Inc., Hoboken, NJ. All rights reserved.
+ William Stallings Computer Organization and Architecture 10 th Edition 2016 Pearson Education, Inc., Hoboken, NJ. All rights reserved. 2 + Chapter 14 Processor Structure and Function + Processor Organization
More informationDigital Video Processing
Video signal is basically any sequence of time varying images. In a digital video, the picture information is digitized both spatially and temporally and the resultant pixel intensities are quantized.
More informationAssembly Language Programming Introduction
Assembly Language Programming Introduction October 10, 2017 Motto: R7 is used by the processor as its program counter (PC). It is recommended that R7 not be used as a stack pointer. Source: PDP-11 04/34/45/55
More informationCS241 Computer Organization Spring 2015 IA
CS241 Computer Organization Spring 2015 IA-32 2-10 2015 Outline! Review HW#3 and Quiz#1! More on Assembly (IA32) move instruction (mov) memory address computation arithmetic & logic instructions (add,
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationLecture 12. Motivation. Designing for Low Power: Approaches. Architectures for Low Power: Transmeta s Crusoe Processor
Lecture 12 Architectures for Low Power: Transmeta s Crusoe Processor Motivation Exponential performance increase at a low cost However, for some application areas low power consumption is more important
More informationHardware and Software Architecture. Chapter 2
Hardware and Software Architecture Chapter 2 1 Basic Components The x86 processor communicates with main memory and I/O devices via buses Data bus for transferring data Address bus for the address of a
More informationCSC 2400: Computer Systems. Towards the Hardware: Machine-Level Representation of Programs
CSC 2400: Computer Systems Towards the Hardware: Machine-Level Representation of Programs Towards the Hardware High-level language (Java) High-level language (C) assembly language machine language (IA-32)
More informationAssembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam
Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying
More informationReal instruction set architectures. Part 2: a representative sample
Real instruction set architectures Part 2: a representative sample Some historical architectures VAX: Digital s line of midsize computers, dominant in academia in the 70s and 80s Characteristics: Variable-length
More informationUMBC. A register, an immediate or a memory address holding the values on. Stores a symbolic name for the memory location that it represents.
Intel Assembly Format of an assembly instruction: LABEL OPCODE OPERANDS COMMENT DATA1 db 00001000b ;Define DATA1 as decimal 8 START: mov eax, ebx ;Copy ebx to eax LABEL: Stores a symbolic name for the
More informationGeneral Purpose Signal Processors
General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:
More informationISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1
ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 22.1 A 125µW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications Tsu-Ming Liu 1, Ting-An Lin 2, Sheng-Zen Wang 2, Wen-Ping Lee
More informationMemory Models. Registers
Memory Models Most machines have a single linear address space at the ISA level, extending from address 0 up to some maximum, often 2 32 1 bytes or 2 64 1 bytes. Some machines have separate address spaces
More informationProcess Variation in Near-Threshold Wide SIMD Architectures
Process Variation in Near-Threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Charkrabari 2, Scott Mahlke 1, David Blaauw 1, Trevor Mudge 1 1 University
More informationAge nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications
Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications N.C. Paver PhD Architect Intel Corporation Hot Chips 16 August 2004 Age nda Overview of the Intel PXA27X processor
More informationHigh Efficiency Video Coding. Li Li 2016/10/18
High Efficiency Video Coding Li Li 2016/10/18 Email: lili90th@gmail.com Outline Video coding basics High Efficiency Video Coding Conclusion Digital Video A video is nothing but a number of frames Attributes
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 22 Title: and Extended
More informationCS 16: Assembly Language Programming for the IBM PC and Compatibles
CS 16: Assembly Language Programming for the IBM PC and Compatibles Discuss the general concepts Look at IA-32 processor architecture and memory management Dive into 64-bit processors Explore the components
More informationAdvanced Video Coding: The new H.264 video compression standard
Advanced Video Coding: The new H.264 video compression standard August 2003 1. Introduction Video compression ( video coding ), the process of compressing moving images to save storage space and transmission
More informationLoad Effective Address Part I Written By: Vandad Nahavandi Pour Web-site:
Load Effective Address Part I Written By: Vandad Nahavandi Pour Email: AlexiLaiho.cob@GMail.com Web-site: http://www.asmtrauma.com 1 Introduction One of the instructions that is well known to Assembly
More informationAn Efficient Vector/Matrix Multiply Routine using MMX Technology
An Efficient Vector/Matrix Multiply Routine using MMX Technology Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection
More informationH.264 Decoding. University of Central Florida
1 Optimization Example: H.264 inverse transform Interprediction Intraprediction In-Loop Deblocking Render Interprediction filter data from previously decoded frames Deblocking filter out block edges Today:
More informationIA-32 Architecture. Computer Organization and Assembly Languages Yung-Yu Chuang 2005/10/6. with slides by Kip Irvine and Keith Van Rhein
IA-32 Architecture Computer Organization and Assembly Languages Yung-Yu Chuang 2005/10/6 with slides by Kip Irvine and Keith Van Rhein Virtual machines Abstractions for computers High-Level Language Level
More informationA 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications
A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System
More informationAssembly Language. Lecture 2 x86 Processor Architecture
Assembly Language Lecture 2 x86 Processor Architecture Ahmed Sallam Slides based on original lecture slides by Dr. Mahmoud Elgayyar Introduction to the course Outcomes of Lecture 1 Always check the course
More informationCSC 8400: Computer Systems. Machine-Level Representation of Programs
CSC 8400: Computer Systems Machine-Level Representation of Programs Towards the Hardware High-level language (Java) High-level language (C) assembly language machine language (IA-32) 1 Compilation Stages
More informationIMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner
IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner Sandbridge Technologies, 1 North Lexington Avenue, White Plains, NY 10601 sjinturkar@sandbridgetech.com
More informationVenezia: a Scalable Multicore Subsystem for Multimedia Applications
Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and
More informationOn-Chip Vector Coprocessor Sharing for Multicores
On-Chip Vector Coprocessor Sharing for Multicores Spiridon F. Beldianu and Sotirios G. Ziavras Electrical and Computer Engineering Department, New Jersey Institute of Technology, Newark, NJ, USA sfb2@njit.edu,
More informationRISC I from Berkeley. 44k Transistors 1Mhz 77mm^2
The Case for RISC RISC I from Berkeley 44k Transistors 1Mhz 77mm^2 2 MIPS: A Classic RISC ISA Instructions 4 bytes (32 bits) 4-byte aligned Instructions operate on memory and registers Memory Data types
More informationA 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing
A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine
More informationThe Microprocessor and its Architecture
The Microprocessor and its Architecture Contents Internal architecture of the Microprocessor: The programmer s model, i.e. The registers model The processor model (organization) Real mode memory addressing
More informationHigh performance, power-efficient DSPs based on the TI C64x
High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research
More informationAVS VIDEO DECODING ACCELERATION ON ARM CORTEX-A WITH NEON
AVS VIDEO DECODING ACCELERATION ON ARM CORTEX-A WITH NEON Jie Wan 1, Ronggang Wang 1, Hao Lv 1, Lei Zhang 1, Wenmin Wang 1, Chenchen Gu 3, Quanzhan Zheng 3 and Wen Gao 2 1 School of Computer & Information
More informationWe can study computer architectures by starting with the basic building blocks. Adders, decoders, multiplexors, flip-flops, registers,...
COMPUTER ARCHITECTURE II: MICROPROCESSOR PROGRAMMING We can study computer architectures by starting with the basic building blocks Transistors and logic gates To build more complex circuits Adders, decoders,
More informationStatic, multiple-issue (superscaler) pipelines
Static, multiple-issue (superscaler) pipelines Start more than one instruction in the same cycle Instruction Register file EX + MEM + WB PC Instruction Register file EX + MEM + WB 79 A static two-issue
More informationFunctional modeling style for efficient SW code generation of video codec applications
Functional modeling style for efficient SW code generation of video codec applications Sang-Il Han 1)2) Soo-Ik Chae 1) Ahmed. A. Jerraya 2) SD Group 1) SLS Group 2) Seoul National Univ., Korea TIMA laboratory,
More informationUsing MMX Instructions to Perform Simple Vector Operations
Using MMX Instructions to Perform Simple Vector Operations Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection with
More informationPicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor
PicoServer : Using 3D Stacking Technology To Enable A Compact Energy Efficient Chip Multiprocessor Taeho Kgil, Shaun D Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Steve Reinhardt, Krisztian Flautner,
More informationEECE416 :Microcomputer Fundamentals and Design. X86 Assembly Programming Part 1. Dr. Charles Kim
EECE416 :Microcomputer Fundamentals and Design X86 Assembly Programming Part 1 Dr. Charles Kim Department of Electrical and Computer Engineering Howard University www.mwftr.com 1 Multiple Address Access
More informationLow-Level Essentials for Understanding Security Problems Aurélien Francillon
Low-Level Essentials for Understanding Security Problems Aurélien Francillon francill@eurecom.fr Computer Architecture The modern computer architecture is based on Von Neumann Two main parts: CPU (Central
More informationMoodle WILLINGDON COLLEGE SANGLI (B. SC.-II) Digital Electronics
Moodle 4 WILLINGDON COLLEGE SANGLI (B. SC.-II) Digital Electronics Advanced Microprocessors and Introduction to Microcontroller Moodle developed By Dr. S. R. Kumbhar Department of Electronics Willingdon
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationIntroduction to Video Compression
Insight, Analysis, and Advice on Signal Processing Technology Introduction to Video Compression Jeff Bier Berkeley Design Technology, Inc. info@bdti.com http://www.bdti.com Outline Motivation and scope
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationOverview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size
Overview Videos are everywhere But can take up large amounts of resources Disk space Memory Network bandwidth Exploit redundancy to reduce file size Spatial Temporal General lossless compression Huffman
More informationComputer Processors. Part 2. Components of a Processor. Execution Unit The ALU. Execution Unit. The Brains of the Box. Processors. Execution Unit (EU)
Part 2 Computer Processors Processors The Brains of the Box Computer Processors Components of a Processor The Central Processing Unit (CPU) is the most complex part of a computer In fact, it is the computer
More informationRegisters. Ray Seyfarth. September 8, Bit Intel Assembly Language c 2011 Ray Seyfarth
Registers Ray Seyfarth September 8, 2011 Outline 1 Register basics 2 Moving a constant into a register 3 Moving a value from memory into a register 4 Moving values from a register into memory 5 Moving
More information16.317: Microprocessor Systems Design I Fall 2015
16.317: Microprocessor Systems Design I Fall 2015 Exam 2 Solution 1. (16 points, 4 points per part) Multiple choice For each of the multiple choice questions below, clearly indicate your response by circling
More informationadministrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions?
administrivia today start assembly probably won t finish all these slides Assignment 4 due tomorrow any questions? exam on Wednesday today s material not on the exam 1 Assembly Assembly is programming
More informationCreating a Scalable Microprocessor:
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B.
More informationRealizing Software Defined Radio A Study in Designing Mobile Supercomputers
Realizing Software Defined Radio A Study in Designing Mobile Supercomputers by Yuan Lin A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer
More informationAn HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication
2018 IEEE International Conference on Consumer Electronics (ICCE) An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication Ahmet Can Mert, Ercan Kalali, Ilker Hamzaoglu Faculty
More informationUsing MMX Instructions to Implement 2D Sprite Overlay
Using MMX Instructions to Implement 2D Sprite Overlay Information for Developers and ISVs From Intel Developer Services www.intel.com/ids Information in this document is provided in connection with Intel
More informationAssembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit
Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Included elements of the IA-64 bit Slides prepared by Kip R. Irvine Revision date: 09/25/2002
More informationEITF20: Computer Architecture Part2.2.1: Pipeline-1
EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle
More informationEN164: Design of Computing Systems Lecture 24: Processor / ILP 5
EN164: Design of Computing Systems Lecture 24: Processor / ILP 5 Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013
ECE 417 Guest Lecture Video Compression in MPEG-1/2/4 Min-Hsuan Tsai Apr 2, 213 What is MPEG and its standards MPEG stands for Moving Picture Expert Group Develop standards for video/audio compression
More informationAn Asynchronous Array of Simple Processors for DSP Applications
An Asynchronous Array of Simple Processors for DSP Applications Zhiyi Yu, Michael Meeuwsen, Ryan Apperson, Omar Sattari, Michael Lai, Jeremy Webb, Eric Work, Tinoosh Mohsenin, Mandeep Singh, Bevan Baas
More informationADDRESSING MODES. Operands specify the data to be used by an instruction
ADDRESSING MODES Operands specify the data to be used by an instruction An addressing mode refers to the way in which the data is specified by an operand 1. An operand is said to be direct when it specifies
More information4G WIRELESS VIDEO COMMUNICATIONS
4G WIRELESS VIDEO COMMUNICATIONS Haohong Wang Marvell Semiconductors, USA Lisimachos P. Kondi University of Ioannina, Greece Ajay Luthra Motorola, USA Song Ci University of Nebraska-Lincoln, USA WILEY
More informationThe Instruction Set. Chapter 5
The Instruction Set Architecture Level(ISA) Chapter 5 1 ISA Level The ISA level l is the interface between the compilers and the hardware. (ISA level code is what a compiler outputs) 2 Memory Models An
More information21 Baseband Processing Architectures for SDR
Vijay/Wireless, Networking, Radar, Sensor Array Processing, and Nonlinear Signal Processing 46047_C021 Page Proof page 1 22.6.2009 4:08pm Compositor Name: BMani 21 Baseband Processing Architectures for
More information05 - Microarchitecture, RF and ALU
September 15, 2015 Microarchitecture Design Step 1: Partition each assembly instruction into microoperations, allocate each microoperation into corresponding hardware modules. Step 2: Collect all microoperations
More informationWilliam Stallings Computer Organization and Architecture 8 th Edition. Chapter 14 Instruction Level Parallelism and Superscalar Processors
William Stallings Computer Organization and Architecture 8 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors What is Superscalar? Common instructions (arithmetic, load/store,
More informationAssembly Language for Intel-Based Computers, 4 th Edition. Chapter 2: IA-32 Processor Architecture. Chapter Overview.
Assembly Language for Intel-Based Computers, 4 th Edition Kip R. Irvine Chapter 2: IA-32 Processor Architecture Slides prepared by Kip R. Irvine Revision date: 09/25/2002 Chapter corrections (Web) Printing
More informationMicroprocessors vs. DSPs (ESC-223)
Insight, Analysis, and Advice on Signal Processing Technology Microprocessors vs. DSPs (ESC-223) Kenton Williston Berkeley Design Technology, Inc. Berkeley, California USA +1 (510) 665-1600 info@bdti.com
More information