Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Size: px

Start display at page:

Download "Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding"

Katrina Cook
6 years ago
Views:

1 ST20 icore and architectures D Albis Tiziano Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster architecture Addressing modes Single-cluster Pipeline architecture Instruction folding Code compression Performance Performance evaluations evaluations 22/06/2007 D'Albis Tiziano 1 1

2 icore is an high-performance implementation of the ST20 architecture designed by STMicroelectronics. Used in many Systems On Chip (SOC) for multimedia applications. Typical target products are digital set-top boxes (STB) and GPS receivers. Main goals of the icore: Performance and efficiency Portability Power consumption 22/06/2007 D'Albis Tiziano 2 Architecture overview RISC like architecture Main differences wrt a traditional RISC architecture are: use of a variable-length instruction word to promote code compactness use of some complex instructions for hardware- implemented kernel functions use of a register stack instead of a large number of machine registers. 22/06/2007 D'Albis Tiziano 3 2

3 Addressing modes Local workspace: procedure stack, changes for each procedure call/return Local Workspace Pointer: pointer to the start of a local workspace (LWP register) Local variable: local_addr = LWP+OFFSET 2 different addressing modes: Local addressing mode For local vars -> fetch in one stage from LWC Non-local addressing mode For non-local vars or non LWC miss -> fetch in two stages from data cache 22/06/2007 D'Albis Tiziano 4 Pipeline IFB IC LWC AGU DC ALU IF1 IF2 ID1 ID2 OF1 OF2 EXE WB IF1: instruction cache tag access IF2: instruction cache data access (IFB load) ID1: instruction decode, fetch of local vars from LWC ID2: address generation for non local vars (AGU) OF1: data cache tag access OF2: data cache data access, pop operands from RF EXE: ALU execution WB: write back (push on RF or write on SB) 22/06/2007 D'Albis Tiziano 5 SB 3

Instruction folding Instruction decoding technique to merge multiple instructions into an operation that occupies a single execution slot in the pipeline ldl ldnl add IF1 IF2 ID1 - - - - - - - - ID2

4 Instruction folding Instruction decoding technique to merge multiple instructions into an operation that occupies a single execution slot in the pipeline ldl ldnl add IF1 IF2 ID ID2 OF1 OF EXE WB IF1 IF2 ID1 ID2 OF1 OF2 EXE WB 22/06/2007 D'Albis Tiziano 6 Performance evaluations IPC increments Instruction folding: +10% LWC: +10% Branch prediction logic: +6% Return stack: +3% Well balanced pipeline even in the worst case 22/06/2007 D'Albis Tiziano 7 4

Lx is an architectural framework (HW and SW toolchain) for VLIW cores designed by HP laboratories and STMicroelectronics ST200 is the family of embedded cores implementing the LX architecture Lx

networking and cryptography 22/06/2007 D'Albis Tiziano 8 Multi-cluster architecture Lx is a statically scheduled VLIW multi-cluster architecture Each cluster is a 4-issue VLIW core All the clusters

5 Lx is an architectural framework (HW and SW toolchain) for VLIW cores designed by HP laboratories and STMicroelectronics ST200 is the family of embedded cores implementing the LX architecture Lx cores are targeted for integer computation- intensive media-processing applications The most important domains in which they are implemented are: digital still-imaging, video and audio processing, networking and cryptography 22/06/2007 D'Albis Tiziano 8 Multi-cluster architecture Lx is a statically scheduled VLIW multi-cluster architecture Each cluster is a 4-issue VLIW core All the clusters are controlled by a single PC and there is a unified 32KB instruction cache The address space is logically shared: inter-cluster communication is achieved by explicit register-toregister move The data cache is a 32 KB, 4-way associative, writeback array MESI-like synchronization protocol for cache coherency 22/06/2007 D'Albis Tiziano 9 5

Single-cluster architecture 22/06/2007 D'Albis Tiziano 10

sparse ILP encoding: frequent nops for one-to-one mapping ->

intrinsic sparse encoding -> Huffman-like compression

6 Single-cluster architecture 22/06/2007 D'Albis Tiziano 10 Code compression Main causes of increases in code size: sparse ILP encoding: frequent nops for one-to-one mapping -> stop bundle bit RISC encoding and exposed latencies: intrinsic sparse encoding -> Huffman-like compression compiler driven code expansion: compiler tuning 22/06/2007 D'Albis Tiziano 11 6

Performance evaluations Frequency VS power: frequency has a cubic effect on power consumption Issue width VS cost: scaling the number of clusters impacts the area (not relevant) and the cost Best

7 Performance evaluations Frequency VS power: frequency has a cubic effect on power consumption Issue width VS cost: scaling the number of clusters impacts the area (not relevant) and the cost Best performances inside the application domain 22/06/2007 D'Albis Tiziano 12 References icore: [1] The icore 520 MHz Synthesizable CPU Core - Richardson, Huang, Hossain (2002) [2] ST20C2/C4 Core Instruction Set Reference Manual STMicroelectronics (1996) : [1] LX: A Technology Platform for Customizable VLIW Embedded Processing P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, F. Homewood (2000) [2] ST200: A VLIW Architecture for Media-Oriented Applications P. Faraboschi, F.Homewood [3] VLIW lessons Daniele Bagni (2004) 22/06/2007 D'Albis Tiziano 13 7

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering