Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

ST20 icore and architectures D Albis Tiziano 707766 Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster architecture Addressing modes Single-cluster Pipeline architecture Instruction folding Code compression Performance Performance evaluations evaluations 22/06/2007 D'Albis Tiziano 1 1

icore is an high-performance implementation of the ST20 architecture designed by STMicroelectronics. Used in many Systems On Chip (SOC) for multimedia applications. Typical target products are digital set-top boxes (STB) and GPS receivers. Main goals of the icore: Performance and efficiency Portability Power consumption 22/06/2007 D'Albis Tiziano 2 Architecture overview RISC like architecture Main differences wrt a traditional RISC architecture are: use of a variable-length instruction word to promote code compactness use of some complex instructions for hardware- implemented kernel functions use of a register stack instead of a large number of machine registers. 22/06/2007 D'Albis Tiziano 3 2

Addressing modes Local workspace: procedure stack, changes for each procedure call/return Local Workspace Pointer: pointer to the start of a local workspace (LWP register) Local variable: local_addr = LWP+OFFSET 2 different addressing modes: Local addressing mode For local vars -> fetch in one stage from LWC Non-local addressing mode For non-local vars or non LWC miss -> fetch in two stages from data cache 22/06/2007 D'Albis Tiziano 4 Pipeline IFB IC LWC AGU DC ALU IF1 IF2 ID1 ID2 OF1 OF2 EXE WB IF1: instruction cache tag access IF2: instruction cache data access (IFB load) ID1: instruction decode, fetch of local vars from LWC ID2: address generation for non local vars (AGU) OF1: data cache tag access OF2: data cache data access, pop operands from RF EXE: ALU execution WB: write back (push on RF or write on SB) 22/06/2007 D'Albis Tiziano 5 SB 3

Instruction folding Instruction decoding technique to merge multiple instructions into an operation that occupies a single execution slot in the pipeline ldl ldnl add IF1 IF2 ID1 - - - - - - - - ID2 OF1 OF2 - - - - - - - - EXE WB IF1 IF2 ID1 ID2 OF1 OF2 EXE WB 22/06/2007 D'Albis Tiziano 6 Performance evaluations IPC increments Instruction folding: +10% LWC: +10% Branch prediction logic: +6% Return stack: +3% Well balanced pipeline even in the worst case 22/06/2007 D'Albis Tiziano 7 4

Lx is an architectural framework (HW and SW toolchain) for VLIW cores designed by HP laboratories and STMicroelectronics ST200 is the family of embedded cores implementing the LX architecture Lx cores are targeted for integer computation- intensive media-processing applications The most important domains in which they are implemented are: digital still-imaging, video and audio processing, networking and cryptography 22/06/2007 D'Albis Tiziano 8 Multi-cluster architecture Lx is a statically scheduled VLIW multi-cluster architecture Each cluster is a 4-issue VLIW core All the clusters are controlled by a single PC and there is a unified 32KB instruction cache The address space is logically shared: inter-cluster communication is achieved by explicit register-toregister move The data cache is a 32 KB, 4-way associative, writeback array MESI-like synchronization protocol for cache coherency 22/06/2007 D'Albis Tiziano 9 5

Single-cluster architecture 22/06/2007 D'Albis Tiziano 10 Code compression Main causes of increases in code size: sparse ILP encoding: frequent nops for one-to-one mapping -> stop bundle bit RISC encoding and exposed latencies: intrinsic sparse encoding -> Huffman-like compression compiler driven code expansion: compiler tuning 22/06/2007 D'Albis Tiziano 11 6

Performance evaluations Frequency VS power: frequency has a cubic effect on power consumption Issue width VS cost: scaling the number of clusters impacts the area (not relevant) and the cost Best performances inside the application domain 22/06/2007 D'Albis Tiziano 12 References icore: [1] The icore 520 MHz Synthesizable CPU Core - Richardson, Huang, Hossain (2002) [2] ST20C2/C4 Core Instruction Set Reference Manual STMicroelectronics (1996) : [1] LX: A Technology Platform for Customizable VLIW Embedded Processing P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, F. Homewood (2000) [2] ST200: A VLIW Architecture for Media-Oriented Applications P. Faraboschi, F.Homewood [3] VLIW lessons Daniele Bagni (2004) 22/06/2007 D'Albis Tiziano 13 7