The Tiny CPU Generator (TCG)

Similar documents
are Softw Instruction Set Architecture Microarchitecture are rdw

Digital System Design Using Verilog. - Processing Unit Design

Computer Science and Engineering 331. Midterm Examination #1. Fall Name: Solutions S.S.#:

COS 140: Foundations of Computer Science

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Chapter 13 Reduced Instruction Set Computers

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Chapter 2A Instructions: Language of the Computer

Lecture 3: The Processor (Chapter 4 of textbook) Chapter 4.1

Chapter 4. The Processor. Computer Architecture and IC Design Lab

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Instruction Set Architectures Part I: From C to MIPS. Readings:

ECS 154B Computer Architecture II Spring 2009

Chapter 2. Computer Abstractions and Technology. Lesson 4: MIPS (cont )

ECE331: Hardware Organization and Design

Pipelining! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar DEIB! 30 November, 2017!

Chapter 2. Instructions: Language of the Computer. Adapted by Paulo Lopes

Instructions: MIPS ISA. Chapter 2 Instructions: Language of the Computer 1

Job Posting (Aug. 19) ECE 425. ARM7 Block Diagram. ARM Programming. Assembly Language Programming. ARM Architecture 9/7/2017. Microprocessor Systems

Parallel logic circuits

CS 3330 Exam 3 Fall 2017 Computing ID:

CSE 141 Computer Architecture Summer Session Lecture 3 ALU Part 2 Single Cycle CPU Part 1. Pramod V. Argade

CS31001 COMPUTER ORGANIZATION AND ARCHITECTURE. Debdeep Mukhopadhyay, CSE, IIT Kharagpur. Instructions and Addressing

COSC 6385 Computer Architecture - Pipelining

Hi Hsiao-Lung Chan, Ph.D. Dept Electrical Engineering Chang Gung University, Taiwan

Chapter 3 : Control Unit

Computer Architecture 2/26/01 Lecture #

COMPSCI 313 S Computer Organization. 7 MIPS Instruction Set

CS150 Fall 2012 Solutions to Homework 6

The register set differs from one computer architecture to another. It is usually a combination of general-purpose and special purpose registers

Computer Organization: A Programmer's Perspective

CS 110 Computer Architecture. Pipelining. Guest Lecture: Shu Yin. School of Information Science and Technology SIST

Question Bank Microprocessor and Microcontroller

COMPUTER ORGANIZATION AND DESIGN

Pipelining. Each step does a small fraction of the job All steps ideally operate concurrently

inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture 18 CPU Design: The Single-Cycle I ! Nasty new windows vulnerability!

LECTURE 3: THE PROCESSOR

Computer Organization CS 206 T Lec# 2: Instruction Sets

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016

Chapter 3. Z80 Instructions & Assembly Language. Von Neumann Architecture. Memory. instructions. program. data

Pipeline Overview. Dr. Jiang Li. Adapted from the slides provided by the authors. Jiang Li, Ph.D. Department of Computer Science

Computer Architecture

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

ECE 313 Computer Organization FINAL EXAM December 11, Multicycle Processor Design 30 Points

6.823 Computer System Architecture Datapath for DLX Problem Set #2

Instruction Set Architecture (Contd)

Hardwired Control (4) Micro-programmed Control Ch 17. Micro-programmed Control (3) Machine Instructions vs. Micro-instructions

Instruction Sets: Characteristics and Functions Addressing Modes

Micro-programmed Control Ch 15

Machine Instructions vs. Micro-instructions. Micro-programmed Control Ch 15. Machine Instructions vs. Micro-instructions (2) Hardwired Control (4)

Micro-programmed Control Ch 15

Basic Processing Unit: Some Fundamental Concepts, Execution of a. Complete Instruction, Multiple Bus Organization, Hard-wired Control,

CPE300: Digital System Architecture and Design

Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Instruction Pipelining

Math 230 Assembly Programming (AKA Computer Organization) Spring 2008

Ch 5: Designing a Single Cycle Datapath

Module 5 - CPU Design

Course Administration

MIPS R-format Instructions. Representing Instructions. Hexadecimal. R-format Example. MIPS I-format Example. MIPS I-format Instructions

Instruction Pipelining

Processor (I) - datapath & control. Hwansoo Han

What is Pipelining? Time per instruction on unpipelined machine Number of pipe stages

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015

CSE140: Components and Design Techniques for Digital Systems

The CPU and Memory. How does a computer work? How does a computer interact with data? How are instructions performed? Recall schematic diagram:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

We briefly explain an instruction cycle now, before proceeding with the details of addressing modes.

Lecture 7 Pipelining. Peng Liu.

Static, multiple-issue (superscaler) pipelines

Input. Output. Datapath: System for performing operations on data, plus memory access. Control: Control the datapath in response to instructions.

CAD for VLSI 2 Pro ject - Superscalar Processor Implementation

CENG3420 Lecture 03 Review

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

Computer Organization. Structure of a Computer. Registers. Register Transfer. Register Files. Memories

UNIT-II. Part-2: CENTRAL PROCESSING UNIT

CSCI 402: Computer Architectures. Instructions: Language of the Computer (3) Fengguang Song Department of Computer & Information Science IUPUI.

Processor Architecture

Computer Architecture Prof. Mainak Chaudhuri Department of Computer Science & Engineering Indian Institute of Technology, Kanpur

EE 3170 Microcontroller Applications

Register Machines. Connecting evaluators to low level machine code

Computer Architecture V Fall Practice Exam Questions

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B DIGITAL SYSTEMS II Fall 1999

Processor Architecture V! Wrap-Up!

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Lecture1: introduction. Outline: History overview Central processing unite Register set Special purpose address registers Datapath Control unit

CSCE 5610: Computer Architecture

Micro-programmed Control Ch 17

CSE 378 Final Exam 3/14/11 Sample Solution

Design Problem 5 Solution

There are different characteristics for exceptions. They are as follows:

What is Pipelining? RISC remainder (our assumptions)

1 /10 2 /16 3 /18 4 /15 5 /20 6 /9 7 /12

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Pipeline Hazards. Midterm #2 on 11/29 5th and final problem set on 11/22 9th and final lab on 12/1.

The Single Cycle Processor

CSIS1120A. 10. Instruction Set & Addressing Mode. CSIS1120A 10. Instruction Set & Addressing Mode 1

OUTLINE. STM32F0 Architecture Overview STM32F0 Core Motivation for RISC and Pipelining Cortex-M0 Programming Model Toolchain and Project Structure

Transcription:

The Tiny CPU Generator (TCG) Yosi Ben-Asher Computer Science Dept. Haifa University, Haifa, Israel yosi@cs.haifa.ac.il The instructions set The machine: The machine is specified by several pipelines where each pipe is activated by a different instruction forming a VIW type of instructions. Each instruction is n bit long where n is determine by the coding of k pipeline stages. For each pipeline, there is a separate memory bank that hold the instructions destined for this pipe. Thus, the instruction of a VIW α β γ are stored in three separate memories and are fetched in parallel in one clock cycle. There is a single data memory bank which is a multi-port memory bank. Maximal number of pipelines (pipes) and parallel instructions is given by some constant. n can be freely selected though it should not increase 56 bits. Note that the more bits we use to code instruction the Decode operation is simplified and the clock cycle improves. The computational instructions are described by a directed graph of operations and arguments (as described in figure. The problem of selecting suitable coding of such directed graphs is left open. We use -bits general registers that are used for computations and for passing parameters to functions. Each pipeline can access all 6 registers. A special register PC points to the next instruction that should be fetched. Each register is backed-up by a small memory bank that acts as a stack for this register. Thus, when a function is called each register is automatically saved on its stack and automatically restored when the function returns. We assume that the recursion depth is limited to some constant. There is no SP and stack area to pass parameters and hold local variables. Parameters are passed via registers, though it is possible to use the main memory to pass arrays, however this requires to use explicit load/store operations to do that. You should also synthesize these memory banks as a separate memory banks.

Each instruction must update one register (even if its a store operation) that is written back to the register file at the last stage of the pipe. A mechanism to stall instructions in the pipe if a value they need on a current stage will be ready at a later stage is usually needed. However, we assume that the compiler computes the amount of stalls and has inserted NOPs between instructions when needed. There is no need for interrupt handlers since we assume that the programs is already loaded to the pipes instruction memory banks. A special reset signal is set to one on the first clock cycle so that initialization can occur. There are four types of instructions as depicted in figure : NOP- is an all zero instruction. STOP- a sequence of 0 NOPs halts the computation. Compute instructions- We assume that we have a pipeline with k stages where each stage can be either:,,, /, <,...- an arithmetic operation followed by possibly shift-left, shift right, negation, zero operations Sl/Sr/N, Z. E - Any possible arithmetic operation. C - Any comparison operation. B - Any boolean operation. M - oad or store operation. If there are more than one M stages than we assume that we have several parallel ports allowing the pipe to perform multiple accesses to the memory at the same time. For example, a pipe M M M M allows four parallel memory references resulting from four instructions crossing the pipeline. The compute instruction contains the coding of the circuit representation of the operations and their arguments (as depicted in figure ). Note that each argument of a pipe unit can be either a register or an internal stage. The number of bits in an instruction n depends on the amount of bits needed for this coding. We assume that the maximal number of pipeline units k per pipe is a known constant. Move instructions- Constants and memory addresses are passed to a register via move instructions. The op field of the move instructions determines an operation on the target register, including: setting R = const, addition R = const,... Branch instructions- Multiple instructions in the same VIW are not allowed, and instructions are assumed to reside only in the instruction memory of the first pipe. In case of a instruction, the rest of the VIW instruction should contain either a NOP or an extension of the (explained later). Branch instructions are divided two the following types:

Function call/ret- A function call is a instruction that saves the return address, and the caller registers in the individual register-stacks. There is a special register (not part of the 6 general set) that is used to save the return address. Function call advances the PC to the beginning of the function being called (the address part of the instruction). The return address is saved in the special register mentioned above (recall that this special register is also saved on its designated stack). The return-instruction restores the registers of the caller (except the register used to return a function value) and set the PC to the correct value. Recall that saving and restoring the register should be done in one parallel step in the IF-stage. Note that parameters are basically passed via registers only, though it is possible to use the main-memory and a SP register for that purpose too. This consumes one coding option from the op field of the instructions. oop instructions- The CPU supports only hardware loops, namely loops that are executed by loading a counter with the requested number of iterations and then executing it. It is possible to reset or modify the loop counter during execution but not to terminate it at the middle of an iteration. Though this complicates programming it saves the need to evict the pipeline when a instruction is executed. Note that when loop-end is encountered, determining the next value of the PC can be done immediately since it depends only on the current value of the loop s counter. Thus, the next instruction that is fetched is determined at the instruction-fetch (IF) stage which is the first stage of the pipe and there is no need to flash out instructions that have been wrongly fetched. We have two instructions:loops counter register, loop end address to start the loop and loope counter register, begining address to determine the loop-end. oops instruction must check that the register counter is not zero and if so jumps after the loop s end address (otherwize the loop s counter is automaticaly decremented). Note that it is possible to use nested loops. Conditional Jumps- are executed in the first stage of the pipe and therefore can affect the next instruction address (PC) without any need to flash instructions from the pipes. If the jump is evaluated to true the jump is a fall-through otherwise we set the PC to P C = P C offset. Jumps contain a logical expression (R i rop R j ) lop (R l rop R k ) i < j < l < k that is used to evaluate the jump s condition where rop = {<, ==, >, } and lop = {AND, OR}. The reg exp coding contains: An eight bits segment where the position of a bit set to one indicates the next register i, j, k, l participating in the reg exp. A two bit segment indicates which half of the 6-registers i, j, k, l are referring too. A six bit segment for coding the three operators rop, lop of the reg exp. The offset can be either a register or a constant number (depending on the first bit). Since the VIW instruction holds only one jump instruction at the first pipe, the rest of the VIW space can be used to increase the reg exp of the first pipe to have an AND or OR of the other pipes reg exps. (the exact formating of this option is determined by you).

0 n compute R * * R= t (R * t), t= *(t= R*R); 0 n R R types of pipeline units E= { none,,, *} C= { none, /, <, =} M= {none,, S,} B = { none, and, or, not} compute R * R= (*t) (R * (t= *R)); R 0 move 0 0 4 op in/out =, reg call ret loops loope 4 4 reg reg 0 4 reg_exp jump and R<R ^ R>R4 0 4 reg_exp jump or R<R ^ R>R4 7 constant address n 7 n constant address 7 n constant address 0 offset 0 offset n n Figure : Instruction s format. Schematics of the design Figure depicts the general structure of the tiny cpu. This CPU has two pipes E M E and M E M. Each latch has multiple occurrences working as a shift registers so that older values can be used at a later stage. Memories can be made to work in one clock cycle, or better in two clock cycles including: placing the address and obtaining the value. The main memory is a multiport memory supporting parallel access of the M units of the pipes. Figure depicts the usage of a reconfigurable crossbar to implement the the different pipeline shapes and variable instructions of the tiny-cpu: It is recommended to use tri-state busses to implement the crossbar, however using muxes is also possible (but slower). The values in the latches connecting the pipeline stages can be used as inputs by several units simultaneously. Older values of the latches can be used for future stages. We assume that the connect/disconnect operation on the crossbar s busses occures during the positive edge of the clock while the registers/latches are updated at the negative edge of the clock. Thus at the same clock cyclevalues are loaded from the registers and latches and the output of the diffrent stages is also strored back in the registes and latches. 4

Ins bank PC Ins bank IF E reg-stack REG MEM M M E E M Figure : General schematic of the cpu. Registers are updated every clock cycle, thus if a register should maintain its current value the switches should connect its output to it sinput. Figure depicts a corssbar that implements two pipelines E M E and M E M with three registers R, R, R. Figure illustrates how the crossbar implements the parallel execution of two instructions R = R (R R) and R = ( R = R R). The different stages of the parallel execution of the two instructions are illustrated as follows:. the R R of R = R (R R) and the R of R = ( R = R R) are performed by setting all the swiches marked (outputs are stored in latches / respectively).. the (RR) of R = R (RR) and the R R of R = ( R = R R) are performed by setting all the swiches marked (outputs are stored in latches /4 respectively).. the R (R R) of R = R (R R) and the R = R R of R = ( R = R R) are performed by setting all the swiches marked (outputs are stored in registers R/R respectively). An alternative to the use of the crossbar is that the state between pipeline stages includes.. Each pipe has a seperate set of leteches and no information is propagated between pipeline stages. Registers are updated only at the writeback stage which is common to all pipes.. Input registers are obtained sperately at the right stage of the pipe and use the current value of the register present in the register file. 4. R = RRR*R7; [R4 = R*R R5 = RR]; 5

R ->R S R atch 4 atch atch atch reg R reg R ->R R IF reg R E M E M E M Figure : Using crossbar techniques to implement the Tiny CPU. Since R4 = R R is executed at the same pipeline as R = R R R R7 then it will observe the right updated value of Rs. However, R5 = R R is executed in another pipe wherein the value of R is not updated. 5. R = RR; /* RR is ready after the firs stage */ [R4 = R*RR4*R6 R5 = R4*R4*RR]; Example Following is the assembly code for the part of transitive closure calculation given in figure 4. et s assume that the variable n is preloaded in the Data Memory Bank address 99. The input connectivity matrix is preloaded starting at address 00, row by row (i.e. M[p][r] data is at the address 00 p n r). The following is pseudo-assembly using commands available in the Tiny CPU. Commands arguments are described in our project definitions. This code is not optimal, it just depicts the assembly you would preload into the Instruction Memory banks. Comments are given given after each instruction while parallel instructions are separated by (assuming that we have the suitable pipelines in the TCG. MOVE R6 = 99 R5 = 00 /* Moving the address of n and the start addr of M into registers*/ COMPUTE R0 = (*R6) R = (*R6) R = (*R6) /* oading the value of n into three counters registers */ COMPUTE R6 = (*R6) /* oading the value of n into the same register */ 6

ABE0: OOPS R0, ABE /* Start of the i-loop, loop counter is R0 */ ABE: OOPS R, ABE /* Start of the j-loop, loop counter is R */ ABE4: OOPS R, ABE5 /* Start of the j-loop, loop counter is R */ COMPUTE R = *(R5 (R6-R0)*R6 (R6-R)) R4 = *(R5 (R6-R)*R6 (R6-R)) /* Parallel computation of M[i][k] and M[k][j] values */ MOVE R5 = 0 R6 = 0 /* Moving 0-constants into regs (preparation for Cond Jump) */ CONDJUMP (R > R5) AND (R4 > R6), offset= /* Conditional jump. If not true - jump offset= lines forward */ MOVE R5 = COMPUTE R6 = ((*((R6-R0)*R6 (R6-R)) = R5)) /* Store the value into the address of M[i][j], and to register R6 */ * Decrement loop counters in parallel */ ABE 5: OOPE R, ABE4 ABE : OOPE R, ABE ABE : OOPE R0, ABE0 /* oop ends */ for(i=0;i<n;i) for(j=0;j<n;j) for(k=0;k<n;k) if(m[i][k] > 0 && M[k][j] > 0 ) M[i][j]= Figure 4: Example: part of a transitive closure computation. 4 Tasks. Create a code example in assembly for some simple algorithm that demonstrates: Demonstrates the main features of loops and function calls. Exploits the use of multiple and complex pipelines. Each team should have a different example, so pls. get my confirmation before you select an example.. Write a C/C program that simulates the TCG and in particular its crossbar mechanism. This program must run on the example implemented in the previous item. This task should be completed until 5. or there will be a panelty of 4 points in the final grade. 7