A True Single Cycle RISC Processor without Pipelining

Similar documents
COSC 243. Computer Architecture 1. COSC 243 (Computer Architecture) Lecture 6 - Computer Architecture 1 1

Microcomputer Architecture and Programming

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital

Chapter 3 : Control Unit

EE 3170 Microcontroller Applications

Computer Architecture

Computer Architecture Programming the Basic Computer

CHAPTER 5 : Introduction to Intel 8085 Microprocessor Hardware BENG 2223 MICROPROCESSOR TECHNOLOGY

M. Sc (CS) (II Semester) Examination, Subject: Computer System Architecture Paper Code: M.Sc-CS-203. Time: Three Hours] [Maximum Marks: 60

Reference Sheet for C112 Hardware

UNIT-III REGISTER TRANSFER LANGUAGE AND DESIGN OF CONTROL UNIT

COS 140: Foundations of Computer Science

Chapter 17. Microprogrammed Control. Yonsei University

UNIT-II. Part-2: CENTRAL PROCESSING UNIT

Micro-Operations. execution of a sequence of steps, i.e., cycles

SIDDHARTH GROUP OF INSTITUTIONS :: PUTTUR Siddharth Nagar, Narayanavanam Road QUESTION BANK (DESCRIPTIVE) UNIT-I

ECE410 Design Project Spring 2013 Design and Characterization of a CMOS 8-bit pipelined Microprocessor Data Path

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University


The von Neumann Architecture. IT 3123 Hardware and Software Concepts. The Instruction Cycle. Registers. LMC Executes a Store.

DC57 COMPUTER ORGANIZATION JUNE 2013

1. INTRODUCTION TO MICROPROCESSOR AND MICROCOMPUTER ARCHITECTURE:

Blog -

Blog -

CN310 Microprocessor Systems Design

1. Internal Architecture of 8085 Microprocessor

Microcontrollers. Microcontroller

Digital System Design Using Verilog. - Processing Unit Design

COMPUTER ORGANIZATION AND ARCHITECTURE

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

CHAPTER SIX BASIC COMPUTER ORGANIZATION AND DESIGN

Advanced Parallel Architecture Lesson 3. Annalisa Massini /2015

1. Internal Architecture of 8085 Microprocessor

Introduction to general architectures of 8 and 16 bit micro-processor and micro-controllers

Job Posting (Aug. 19) ECE 425. ARM7 Block Diagram. ARM Programming. Assembly Language Programming. ARM Architecture 9/7/2017. Microprocessor Systems

Computer Organization CS 206 T Lec# 2: Instruction Sets

Module 5 - CPU Design

Code No: R Set No. 1

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers.

Compact Integrated Processor

UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT

Understanding the basic building blocks of a microcontroller device in general. Knows the terminologies like embedded and external memory devices,

Introduction to Microprocessor

Computer Organization

1. Internal Architecture of 8085 Microprocessor

Lab 16: Data Busses, Tri-State Outputs and Memory

Computer Systems. Binary Representation. Binary Representation. Logical Computation: Boolean Algebra

MICROPROGRAMMED CONTROL

Microprocessors and Microcontrollers. Assignment 1:

Digital IP Cell 8-bit Microcontroller PE80

EKT 422/4 COMPUTER ARCHITECTURE. MINI PROJECT : Design of an Arithmetic Logic Unit

Embedded Soc using High Performance Arm Core Processor D.sridhar raja Assistant professor, Dept. of E&I, Bharath university, Chennai

COMPUTER ORGANIZATION AND DESIGN

CAD4 The ALU Fall 2009 Assignment. Description

Chapter 3. Z80 Instructions & Assembly Language. Von Neumann Architecture. Memory. instructions. program. data

Chapter 2 Logic Gates and Introduction to Computer Architecture

Intel 8086 MICROPROCESSOR. By Y V S Murthy

8051 microcontrollers

UNIT I BASIC STRUCTURE OF COMPUTERS Part A( 2Marks) 1. What is meant by the stored program concept? 2. What are the basic functional units of a

E3940 Microprocessor Systems Laboratory. Introduction to the Z80

END-TERM EXAMINATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Chapter 7 Central Processor Unit (S08CPUV2)

V8-uRISC 8-bit RISC Microprocessor AllianceCORE Facts Core Specifics VAutomation, Inc. Supported Devices/Resources Remaining I/O CLBs

STRUCTURE OF DESKTOP COMPUTERS

Design and Implementation of a FPGA-based Pipelined Microcontroller

Darshan Institute of Engineering & Technology for Diploma Studies Unit - 1

1 MALP ( ) Unit-1. (1) Draw and explain the internal architecture of 8085.

3.1 Description of Microprocessor. 3.2 History of Microprocessor

Microprocessors/Microcontrollers

Contents. Chapter 9 Datapaths Page 1 of 28

Architecture of 8085 microprocessor

Major and Minor States

2. List the five interrupt pins available in INTR, TRAP, RST 7.5, RST 6.5, RST 5.5.

EE 354 Fall 2015 Lecture 1 Architecture and Introduction

Embedded Systems Ch 15 ARM Organization and Implementation

Description of the Simulator

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

Parallel logic circuits

A 32-bit Processor: Sequencing and Output Logic

UNIT II OVERVIEW MICROPROCESSORS AND MICROCONTROLLERS MATERIAL. Introduction to 8086 microprocessors. Architecture of 8086 processors

UPY14602-DIGITAL ELECTRONICS AND MICROPROCESSORS Lesson Plan

Computer organization and architecture UNIT-I 2 MARKS

REGISTER TRANSFER LANGUAGE

e-pg Pathshala Subject : Computer Science Paper: Embedded System Module: 8051 Architecture Module No: CS/ES/5 Quadrant 1 e-text

Implementing the Control. Simple Questions

Tailoring the 32-Bit ALU to MIPS

EMBEDDED SYSTEM DESIGN (10EC74)

MICROPROCESSOR PROGRAMMING AND SYSTEM DESIGN

Computer Organisation CS303

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

The Itanium Bit Microprocessor Report

Segment 1A. Introduction to Microcomputer and Microprocessor

CPE300: Digital System Architecture and Design

BUILDING BLOCKS OF A BASIC MICROPROCESSOR. Part 1 PowerPoint Format of Lecture 3 of Book

Verilog for High Performance

Topics in computer architecture

Introduction to Computers - Chapter 4

Topics Power tends to corrupt; absolute power corrupts absolutely. Computer Organization CS Data Representation

Transcription:

1 A True Single Cycle RISC Processor without Pipelining Robert S. Plachno, VP of Audio Abstract This paper details the design of a embedded RISC controller used for mixed signal audio integrated circuits. This processor replaced an existing 8 bit CISC embedded processor and obtained a performance improvement of about 6x. This performance improvement was entirely due to architectural improvements using the same input clock rate and external ROM IP block. Index Terms Computer architecture, Memory management, Pipeline processing, Reduced instruction set computing. T I.INTRODUCTION HE architecture of a RISC processor should support single cycle operation. The definition of a signal cycle operation is continuous instruction fetches from the instruction memory (ROM in this case) at the maximum access rate of the memory. Most RISC processor designs obtain this performance by pipelining. With a pipelined architecture each instruction is fetched assuming the next instruction is at the next physical instruction address (PC+1). If a jump instruction occurs then the pipeline is flushed or a delay must occur while the correct instruction address is calculated. This paper describes a RISC architecture in which single cycle operation is obtained without using a pipelined design. This RISC processor was designed as an embedded controller. Other architectural advantages of this design will be discussed as well as the implementation and design techniques A.CISC versus RISC II.THE LIMITATIONS OF THE PREVIOUS DESIGN The previous designs used an 8-bit CISC as an embedded controller. This CISC was inefficient and had difficulty to support continuous customer requests for additional features. The register instructions required 5 internal clock cycles and instructions using external memory usually required 6 internal cycles. The internal register to register instructions had a low utilization in the program. Instructions using external memory were more common. The instruction fetch occurred over an 8 bit bus using multiple cycles which depended on the instruction type. The CISC design used a ROM organized as 24K by 8 for the instruction memory. Most instruction fetches required 2 to 3 ROM accesses. This technical paper describes an embedded RISC controller used in mixed signal audio products from 1993 through 1999. Adapting a RISC architecture was forced by the new customer feature requirements. This integrated circuit did not use a PLL so the input clock rate could not be simply sped up. Any performance improvement had to come from purely an architectural change by obtaining single cycle operation. B.Operating Voltage Range This integrated circuit was for PC audio products that could be designed into desk top PCs or notebooks. In this time frame desk tops used 5V operation while notebooks required 3.3V operation. The analog circuits were designed to run over this wide power supply range and at multiple fabrication houses. It was surprising to find that the circuit with the least operating voltage margin was the CISC processor. The CISC was a fully custom design that had issues with the low voltage operation. The new RISC design lowered the operating voltage dramatically to the point where the processor was not the limiting factor and the chip gained over a half a volt of margin. C.Royalty Cost The CISC was a purchased design which required royalty payments. Since the PC audio products had volume shipments at over 2M per month it was desirable to eliminate this cost burden III.MEMORY FOR PROCESSOR DESIGNS My original experience was in memory design including pseudo-static designs where the memory pre-charge is hidden from the user. Later I worked on several processor designs including a 64 bit processor that had a large design group. It became apparent that most processor engineers did not understand how memories worked and I had to teach them how to efficiently interface to their memory blocks. Memories require a pre-charge. In this time frame it was a standard practice to divide the memory cycle in two. The first half cycle is for pre-charge and the second half-cycle is the actual memory access. Addresses must only change during the pre-charge time and the address set-up time is actually measured to the center clock transition (at the end of the first half-cycle). The existing ROM used for the CISC was designed exactly in this manner. As time progressed logic engineers became even more ignorant of their memory blocks and the circuit designers of the memories made their specified interfaces safer. With the change in philosophy to synchronous designs then using latches fell out of favor to using flip-flops. Most engineers now put flip-flops to fix the addresses at the input of the

2 memory. This wastes almost a full half cycle. A correct design would use a latch that is open during the pre-charge period. It is also extremely wasteful to inset flip-flops on the output data of the memory. A correct design would have a latch open during the second half cycle of the memory and memory designs should already include this latch internally. Mentally this problem is a conceptual difference caused by going to a synchronous design philosophy. The pipeline design engineer has flip-flops in too many places. How can you do an instruction decode and the next instruction address calculation (for jump instructions) all on the one clock edge between two instruction fetches? In reality you have about half a clock cycle to perform these calculations. You have the margin for the ROM access time for the last instruction fetch plus the precharge time minus the address set-up time for the next instruction fetch. IV.BASIC OVERVIEW OF THE RISC DESIGN The bit widths of each unit are as follows: Instruction Unit: 24 bits Execution Unit: 8 bits Memory Unit: 16 bits The instruction ROM used for the CISC was an 8Kx8 block repeated 3 times and organized as 24K by 8. The RISC used the same original ROM 8Kx8 block but has it organized as 8Kx24 since the instruction is 24 bits wide. All instruction fetches are a single cycle 24 bit access. Both the instruction ROM and any external RAM has 16 bit addressing as indicated by the 16 bit memory unit width. This allows the RISC to have an 8 bit opcode and 16 bit address for jump instructions, etc all in one instruction fetch. This means there is no relative addressing jump calculations. The address for the jump instructions is always immediate. It does not have to be calculated but only multiplexed with PC+1. ROM and RAM access have separate opcodes so in effect there is 17 bit addressing for external memories. The register file feeding the execution unit has 8 bit addressing but the MSB for the register file is always 0. Since the RISC is an embedded controller the other 128 register addresses (MSB=1) are reserved for external user defined registers. Within the 24bit instruction width you have 8 bits for the opcode and two 8 bit addresses for the two operands for the execution unit instructions. Immediate commands have an 8 bit opcode, an 8 bit address for the operand destination and the 8 bit immediate value. There is an obvious trade off to have only an 8 bit execution unit. For 16 bit audio calculations two instructions are performed. For example, 16 bit values are added by two instructions: ADD followed by an ADC (add with carry). The CISC performed in the same manner. The assembler cross assembled all of the previous CISC instructions directly to the RISC instruction set using multiple RISC instructions if required. For example a DJNZ (decrement and jump if not zero) command is assembled as two instructions: ADD Rnum, #%FF (add negative 1) and JP NZ, label (conditional jump if not zero). Fig 1 documents the RISC instruction set and decode table. V.THE REGISTER FILE The operands are read and written to a 128 byte register file. All registers are general purpose. This register file is both double pumped meaning there are two accesses in the same time period as one ROM access (interaction fetch) and the register file is a true dual-port meaning both source operands can be read at the same time. Since it is a dual port memory there are two 8 bit data busses: A & B that connect the register file to the execution unit. The A bus is used for reading operand A and for writing the result. The B bus is only for reading operand B. Fig 2 shows the memory cell for the register file. VI.EXECUTION UNIT The execution unit consists of two operand latches, a barrel shifter, and an ALU. A.Operand Latches Two operand busses (A and B) are required for single cycle operation. Both input operands are read from the register file simultaneously. There are two non-overlapping clocks in the RISC called CK1 and CK2. The operand latches are loaded during CK1 from the register file and the output from the ALU is stored back into the register file during CK2. Note that the operand data is allowed to ripple through the latches during the time the data becomes valid. The latches are closed later to avoid corruption before the register file goes into pre-charge. Both operand latches are identical and can be independently reset or inverted (reset and invert together is a set). Clearing and inverting the operands are required for some of the operations in both the ALU and the shifter. B.Barrel Shifter This shifter can rotate right or left inserting zeroes, ones or wrapping around LSB-to-MSB. The trick is all in the layout and in how the operand registers drive it. Logically the shifter is nothing more than 8 to 1 multiplexers for every bit of the operand. Physically the A bus drives into the top and then the wires shifts down one bit to the left for each multiplexer input. Physically the B bus drives into the bottom and then the wires shifts up one bit to the right for each multiplexer input. If you drive both operands with the same value then you barrel shift. All eight shift possibilities are available at the inputs of the multiplexers. The shift amount is determined by selecting one of the eight possibilities. By clearing or setting one of the operand registers you insert either leading or trailing 0 s or 1 s as you shift. Fig 4 shows the shifter wiring.

3 C.ALU The Arithmetic Logic Unit can be described as a LFU (Logical Function Unit) with a generate-propagate static CMOS Manchester carry chain design. An LFU means it has a programmable truth table operation for any Boolean function of the two input operands. The four control lines: LFU0 to LFU3 specify each bit of the truth table function. Both the carry chain and the zero detect logic is buffered every four bits. To generate the result for the 8 bit ALU, the carry ripples through only four inverting stages (total) and each of these stages is an inverter. This carry chain implementation is fully static. It does not require any pre-charge clocks. Fig 3 shows the schematic for the ALU design. VII.INSTRUCTION UNIT The instruction unit consists of the Program Counter (PC), the ROM interface registers, and a hardware stack for subroutines and interrupts. A.Program Counter Register This is a 16 bit register. A separate adder which is a simplified but similar design to the ALU does the count function. This register always holds the current ROM instruction address plus one. This is the value which is pushed onto the stack for subroutine CALL s and for interrupts so that you RETURN to the next valid instruction. If the present instruction is not a jump, call, branch, etc. then the default is to use the PC register value for the next ROM address. B.The ROM Interface Registers This logic has several functions. The next ROM address is multiplexed from the Program Counter, the stack (for Returns), or the immediate value on the instruction word (for Jumps). Reset forces the address 0 and interrupt addresses can be forced to 1, 2, or 3 (which always contain unconditional jumps). This circuit also does the addressing manipulation for reading data from the ROM. The data stored in the ROM uses byte addressing which has to be unpacked from a 8Kx24 or an 16Kx24 configuration. C.The Hardware Stack The stack design is a pure synchronous implementation. It is four 16 bit registers which are always loaded every cycle. If the current opcode is a CALL then the register above it is loaded (pushed). If the current opcode is a RETURN then the register below it is loaded (popped). If the opcode is neither then its own output is multiplexed back to retain its present value. VIII.MEMORY UNIT The RISC does not have complicated memory management. However, the memory unit does the functions of the register file address generation, the A-B-C data bus multiplexing, and the flip-flops for the instruction word for data unpacking. IX.OTHER FEATURES This RISC was designed as an embedded controller for audio applications and has some unique features. A.Indirect Addressing Since the execution unit does not have a multiplier, audio compression and decompression is done by table look-up. For this function indirect addressing is very important. Both the register file addresses and the external memory addressing can be done through another set of registers. The register file addressing costs another instruction to set a unique page register. The external memory can be addressed using a register pair from the register file. The logic for the register pair addressing can also be used for indirect addressing on subroutine calls or jump instructions. B.External Register Addressing As mentioned before there are 8 address bits for the register and the register file only uses the lower 128 bytes. The higher 128 bytes are user defined to be specific registers through the integrated circuit design. These external registers interface to the RISC module through the C bus. This means that external registers can be specified as a source or destination in an ADD or other execution unit instruction. Other architectures require loading the values first to an internal register. C.User Defined Flags The RISC has the four standard condition flags for Carry, Zero, Sign, and Overflow. However there are eight flags total that can be utilized. The other four flag bits are user defined. For example this can be a signal such as a FIFO full flag. D.ROM Data Packing The instruction width is 24 bytes. However, to perform table look-up compression, data must be read from the ROM using byte addressing. The three byte wide words are packed with the LSB two data bytes sequentially up to the top of the memory and then back down using the MSB bytes. E.Hardware Stack This is a feature which can be a disadvantage. Using a hardware stack simplifies the design and speeds up the execution. The subroutine RETURN in the CISC took 19 cycles while the RETURN in the RISC takes one cycle. However, the design is limited to only four nested calls and interrupts which is sufficient for the audio design. This can be un-nerving to some programmers.

4 F.Large Register File The RISC has 128 bytes of general purpose registers. This was large enough that no external RAM was used for the audio application. The original CISC design did use an external RAM in addition to its internal registers. A.Engineers X.DESIGN IMPLEMENTATION The RISC was designed by Roi Peers and Robert Plachno. Roi was the architect on the PC audio chips. He did the system level design and the software coding. For the RISC he defined the requirements and helped with the architecture. Robert Plachno had designed several processors prior to this RISC including smaller embedded controllers and a larger 64 bit processor. Robert did the design and simulation of the RISC. Vincent Chueng replaced the CISC with the RISC in the audio chip and fixed the software timing issues. XI.CONCLUSION A design for a single cycle RISC processor has been discussed that does not use pipelining. This operation is obtained by folding the processor execution into the memory cycle. The RISC uses a ripple through latch style of design as opposed to a synchronous flip-flop style design. The design improvements include: 6x performance improvement. Wide power supply range for both desktop and notebook applications. No royalty fees. Minimum software impact. This is an architectural change only. No process improvements or clock rate increase were required. B.Semi-Custom Design The RISC was designed in CMOS technology and run at numerous fabrication houses from 0.6µ to 0.35µ channel lengths. The data paths were designed at a transistor level. Four sections including the data paths for the execution unit, instruction unit, memory unit, and the register file were custom laid out and placed together in a rectangular area. The remaining control logic was routed as standard cells. Figure 5 shows the PC audio chip with the RISC in the top left. C.CAE Tools The original design was entered using the ORCAD schematic tool. The simulations were done using Robert Plachno s EESIM. This simulator allows a mixed mode spice and logic netlist. Modules can be simulated at the transistor, gate or behavior level. The simulator also indicates the 10 worst (or specified) set-up times, hold times, etc and calculates the power dissipation and test vector coverage. The initial design was simulated on a PC and then progressed to a UNIX system. Eventually the schematics were recaptured into Cadence and simulated using Verilog. D.Initial Debug The design was initially done as a test chip on a multi-up mask set. This was debugged using test vectors transferred to an IMS tester. Then the CISC was replaced by the RISC in a full PC audio design. The assembler (written by Plachno) automatically cross-assembles the CISC instruction set to the RISC instruction set. However, certain parts of the program were found to be self-timed (software emulated serial port) and other problems occurred since the processor performed about 6x faster. I would describe replacing the CISC with the RISC in the design as having moderately few issues.

5 Figure 1. RISC Instruction Set Definition

6 Figure 2. Register File Dual Port Memory Cell BL1 BLB1 BL2 WL1 VCC VCC M4 M1 M8 M6 M5 M2 WLB2 VCC M9 M10 Figure 3. ALU Design

7 Figure 4. Barrel Shifter. Figure 5. PC Audio Chip with RISC