Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Similar documents
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

Design of a Pipelined and Parameterized VLIW Processor: ρ-vex v2.0

THE ICORE 520-MHZ SYNTHESIZABLE CPU CORE

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Getting CPI under 1: Outline

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

INSTRUCTION LEVEL PARALLELISM

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

The Processor: Instruction-Level Parallelism

Static Multiple-Issue Processors: VLIW Approach

Advanced Instruction-Level Parallelism

EN164: Design of Computing Systems Lecture 24: Processor / ILP 5

Comparing Memory Systems for Chip Multiprocessors

Dynamic Control Hazard Avoidance

IF1/IF2. Dout2[31:0] Data Memory. Addr[31:0] Din[31:0] Zero. Res ALU << 2. CPU Registers. extension. sign. W_add[4:0] Din[31:0] Dout[31:0] PC+4

Processor (IV) - advanced ILP. Hwansoo Han

LECTURE 10. Pipelining: Advanced ILP

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Course on Advanced Computer Architectures

A 1-GHz Configurable Processor Core MeP-h1

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

CS425 Computer Systems Architecture

Spring 2014 Midterm Exam Review

Multiple Instruction Issue. Superscalars

One instruction specifies multiple operations All scheduling of execution units is static

CPU Structure and Function

Hardware-Based Speculation

Basic Computer Architecture

Instruction Level Parallelism. Appendix C and Chapter 3, HP5e

Amber Baruffa Vincent Varouh

Lecture 4: RISC Computers

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store.

Lecture 4: RISC Computers

These slides do not give detailed coverage of the material. See class notes and solved problems (last page) for more information.

VLIW DSP Processor Design for Mobile Communication Applications. Contents crafted by Dr. Christian Panis Catena Radio Design

Hardware-Based Speculation

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CPU Pipelining Issues

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Lecture 9: Multiple Issue (Superscalar and VLIW)

Motivation. Banked Register File for SMT Processors. Distributed Architecture. Centralized Architecture

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Processors. Young W. Lim. May 12, 2016

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

Pipelining to Superscalar

5008: Computer Architecture

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

What is ILP? Instruction Level Parallelism. Where do we find ILP? How do we expose ILP?

Good luck and have fun!

Very short answer questions. "True" and "False" are considered short answers.

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Processor Architecture

Chapter 4 The Processor (Part 4)

EITF20: Computer Architecture Part2.2.1: Pipeline-1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

The IA-64 Architecture. Salient Points

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Computer and Information Sciences College / Computer Science Department Enhancing Performance with Pipelining

New Advances in Micro-Processors and computer architectures

MAP1000A: A 5W, 230MHz VLIW Mediaprocessor

Lecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Processing Unit CS206T

ELE 375 Final Exam Fall, 2000 Prof. Martonosi

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Computer Architecture and Engineering. CS152 Quiz #4 Solutions

55:132/22C:160, HPCA Spring 2011

Advanced Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

Advanced Computer Architecture

EITF20: Computer Architecture Part2.2.1: Pipeline-1

VLIW/EPIC: Statically Scheduled ILP

(1) Using a different mapping scheme will reduce which type of cache miss? (1) Which type of cache miss can be reduced by using longer lines?

Static, multiple-issue (superscaler) pipelines

ECE/CS 552: Pipelining to Superscalar Prof. Mikko Lipasti

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

The CPU Pipeline. MIPS R4000 Microprocessor User's Manual 43

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

EITF20: Computer Architecture Part2.1.1: Instruction Set Architecture

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

Itanium 2 Processor Microarchitecture Overview

Chapter 4 The Processor 1. Chapter 4D. The Processor

Superscalar Organization

EECC551 - Shaaban. 1 GHz? to???? GHz CPI > (?)

Compiler Architecture

Lecture 13 - VLIW Machines and Statically Scheduled ILP

EKT 303 WEEK Pearson Education, Inc., Hoboken, NJ. All rights reserved.

ARM Processors for Embedded Applications

Programmazione Avanzata

Embedded Systems. 7. System Components

Transcription:

ST20 icore and architectures D Albis Tiziano 707766 Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster architecture Addressing modes Single-cluster Pipeline architecture Instruction folding Code compression Performance Performance evaluations evaluations 22/06/2007 D'Albis Tiziano 1 1

icore is an high-performance implementation of the ST20 architecture designed by STMicroelectronics. Used in many Systems On Chip (SOC) for multimedia applications. Typical target products are digital set-top boxes (STB) and GPS receivers. Main goals of the icore: Performance and efficiency Portability Power consumption 22/06/2007 D'Albis Tiziano 2 Architecture overview RISC like architecture Main differences wrt a traditional RISC architecture are: use of a variable-length instruction word to promote code compactness use of some complex instructions for hardware- implemented kernel functions use of a register stack instead of a large number of machine registers. 22/06/2007 D'Albis Tiziano 3 2

Addressing modes Local workspace: procedure stack, changes for each procedure call/return Local Workspace Pointer: pointer to the start of a local workspace (LWP register) Local variable: local_addr = LWP+OFFSET 2 different addressing modes: Local addressing mode For local vars -> fetch in one stage from LWC Non-local addressing mode For non-local vars or non LWC miss -> fetch in two stages from data cache 22/06/2007 D'Albis Tiziano 4 Pipeline IFB IC LWC AGU DC ALU IF1 IF2 ID1 ID2 OF1 OF2 EXE WB IF1: instruction cache tag access IF2: instruction cache data access (IFB load) ID1: instruction decode, fetch of local vars from LWC ID2: address generation for non local vars (AGU) OF1: data cache tag access OF2: data cache data access, pop operands from RF EXE: ALU execution WB: write back (push on RF or write on SB) 22/06/2007 D'Albis Tiziano 5 SB 3

Instruction folding Instruction decoding technique to merge multiple instructions into an operation that occupies a single execution slot in the pipeline ldl ldnl add IF1 IF2 ID1 - - - - - - - - ID2 OF1 OF2 - - - - - - - - EXE WB IF1 IF2 ID1 ID2 OF1 OF2 EXE WB 22/06/2007 D'Albis Tiziano 6 Performance evaluations IPC increments Instruction folding: +10% LWC: +10% Branch prediction logic: +6% Return stack: +3% Well balanced pipeline even in the worst case 22/06/2007 D'Albis Tiziano 7 4

Lx is an architectural framework (HW and SW toolchain) for VLIW cores designed by HP laboratories and STMicroelectronics ST200 is the family of embedded cores implementing the LX architecture Lx cores are targeted for integer computation- intensive media-processing applications The most important domains in which they are implemented are: digital still-imaging, video and audio processing, networking and cryptography 22/06/2007 D'Albis Tiziano 8 Multi-cluster architecture Lx is a statically scheduled VLIW multi-cluster architecture Each cluster is a 4-issue VLIW core All the clusters are controlled by a single PC and there is a unified 32KB instruction cache The address space is logically shared: inter-cluster communication is achieved by explicit register-toregister move The data cache is a 32 KB, 4-way associative, writeback array MESI-like synchronization protocol for cache coherency 22/06/2007 D'Albis Tiziano 9 5

Single-cluster architecture 22/06/2007 D'Albis Tiziano 10 Code compression Main causes of increases in code size: sparse ILP encoding: frequent nops for one-to-one mapping -> stop bundle bit RISC encoding and exposed latencies: intrinsic sparse encoding -> Huffman-like compression compiler driven code expansion: compiler tuning 22/06/2007 D'Albis Tiziano 11 6

Performance evaluations Frequency VS power: frequency has a cubic effect on power consumption Issue width VS cost: scaling the number of clusters impacts the area (not relevant) and the cost Best performances inside the application domain 22/06/2007 D'Albis Tiziano 12 References icore: [1] The icore 520 MHz Synthesizable CPU Core - Richardson, Huang, Hossain (2002) [2] ST20C2/C4 Core Instruction Set Reference Manual STMicroelectronics (1996) : [1] LX: A Technology Platform for Customizable VLIW Embedded Processing P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, F. Homewood (2000) [2] ST200: A VLIW Architecture for Media-Oriented Applications P. Faraboschi, F.Homewood [3] VLIW lessons Daniele Bagni (2004) 22/06/2007 D'Albis Tiziano 13 7