Statically Calculating Secondary Thread Performance in ASTI Systems
|
|
- Oscar Gilmore
- 6 years ago
- Views:
Transcription
1 Statically Calculating Secondary Thread Performance in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alex Guimarães Dean Center for Embedded Systems Research Department of Electrical and Computer Engineering North Carolina State University 1
2 Overview ASTI: Asynchronous Software Thread Integration Register File Partitioning Experiments Conclusions 2
3 Basic Idea of ASTI Goal: recover fine-grain idle-time for use by other threads Examine program to find a function f with significant internal idle time Idle time is imposed by instruction-level timing requirements (e.g. for input, output instructions) If an idle time piece n is coarse-grain (T Idle (f,n) >> 2*T ContextSwitch ), then we can recover it efficiently with context switching If it is fine grain (T Idle (f,n)!>> 2*T ContextSwitch ), then apply ASTI (Asynchronous Software Thread Integration) Details of ASTI in LCTES 2004, CASES
4 ASTI Applied to Communication Protocols Executive ReceiveMessage ReceiveBit Prepare message buffer Subroutine calls Read bit from bus 3 times and vote Idle time Return Check for errors, save bit, update CRC Sample bus for resynchronization Primary Thread Integrated Secondary Secondary Thread Thread Executive ReceiveMessage ReceiveBit Need only first and last Coroutine coroutine calls Wasted Recover idle even time, too short short idle for cocall time 4
5 Protocol Controller Options Analog & Digital I/O Analog & Digital I/O Analog & Digital I/O Analog & Digital I/O System MCU System MCU Optional System MCU I/O Expander Discrete Protocol Controller On-Board Protocol Controller Generic MCU with ASTI S/W Protocol Controller Physical Layer Transceiver Physical Layer Transceiver Physical Layer Transceiver Physical Layer Transceiver I/O Expander Communication Network Discrete protocol controller with MCU MCU with on-board protocol controller Generic MCU with ASTI SW protocol controller 5
6 But what about Caches? Deep instruction pipelines? Branch prediction? Superscalar instruction execution? Speculative execution? The reorder buffer? Page faults? Forwarding paths? Load queues? Data prefetching? Predicated execution? Branch delay slots? Instruction prefetching? Store forwarding? R-ops Dynamic optimization The phase of the moon Wind direction Et cetera, et cetera 6
7 Register File Partitioning Single register file must support primary and secondary threads Three ways to use a register For primary thread exclusively For secondary thread exclusively Shared between the two, swapped on coroutine calls Register file may not be homogeneous Pointer/address registers Immediate-operand capable registers... so need to pick best partition for each type. How? 7
8 Primary and Secondary Thread Performance Impact of register file partitioning More registers for primary thread Less spilling and filling -> primary code takes fewer cycles More idle time -> more cycles for secondary thread Fewer registers for secondary thread -> more spilling and filling - > secondary thread requires more cycles, response time rises More registers for secondary thread Similar case More registers swapped Both threads require fewer cycles to execute Coroutine call takes longer -> More cycles wasted switching between threads -> Now coroutine call fits doesn t fit into shorter idle time fragments, reducing cycles available for secondary thread How do we find the best register file partitioning? Too complex to compute everything analytically Instead compile and analyze iteratively to perform design space exploration 8
9 Thrint foo.s Control-flow Analysis Data-flow Analysis Static Timing Analysis GProf Integration Analysis Integration foo.int.s foo.id XVCG GnuPlot Thread Integrating Compiler Back-End: Thrint Have enhanced Thrint to Integrate threads using ASTI methods (was just STI) Measure best, worst case performance for secondary thread 9
10 Iterative Partition Analysis Toolchain Register File Partitioning Decisions s_m.c Primary Thread r_m.c s_b.c gcc Thrint: ICTA T SegmentIdle r_b.c Original Performance of Secondary Thread Secondary Thread gcc Thrint T Sec Performance Comparison: Slowdown vs. Dedicated MCU gcc Thrint T Sec-Seg-Part Performance of Segmented, Partitioned Secondary Thread 10
11 Atmel AVR 8-bit load/store architecture for microcontrollers Register File (32) Pointer + immediate (6) Immediate (10) Other (16) Protocol controllers in C CAN: 62.5 kbps MIL-STD-1553: 1 Mbps Secondary threads in C Network-RS232 bridge PID controller Compiled with AVR-GCC, -O3 Experiments Bus Bridge MCU Dig. Out Dig. In ASTI Software Primary Thread (J1850) Message Queues Secondary Thread (Interface) UART System MCU UART 11
12 Performance Evaluation Measure slowdown of integrated secondary thread (worst-case execution path) with partitioned register file, compared with original full-register file performance Need to evaluate and schedule for worst-case to ensure system always meets its deadlines How much performance do we give up by partitioning the register file? Not all partitions are schedulable Not enough time for coroutine call Not enough time for primary thread to meet its I/O instruction deadlines 12
13 Results I: Average Performance Slowdown 60% 40% 20% Bridge (Host Interface) PID Controller 0% 1553Send 1553Receive CANSend CANReceive Average performance for all feasible partitioning approaches 13
14 Results II: Best Performance 2.0% 1.5% Bridge (Host Interface) PID Controller Slowdown 1.0% 0.5% 0.0% -0.5% 1553Send 1553Receive CANSend CANReceive Find best (least slow-down) of all feasible partitionings AVR register file is adequate to handle register pressure for both threads, or idle time is adequate for coroutine calls 14
15 Detailed Analysis Example Primary: 1553 send Secondary: PID controller Immediate registers Secondary is sensitive to # of immediate registers Primary: CAN send Secondary: RS232-CAN Bridge Other registers Cocall must be brief for schedulability Best is 1.5% slowdown: 10,6 to 14,2 with no swapped registers 15
16 Conclusions Conclusions and Future Work Performance varies significantly for AVR architecture Average case bad Best case close to non-partitioned register file Future Work Derive and evaluate heuristics to search efficiently through partitioning design space Replace coroutine call with dispatcher to support multiple secondary threads 16
17 Questions? Have you applied this to SPEC? No, that s not representative of embedded software-implemented communication protocols Don t caches break the timing predictability you need? The processors we use run at under 50 MHz, so we don t have a memory wall Why not use a multithreaded processor? They re too expensive, too rare, and businesses prefer familiar processors Why not just design an ASIC to do it? Too expensive to get the first one Why not program an FPGA to do it? Too expensive to get the rest of them 17
18 Appendices 18
19 Why Network Communication Protocol Controllers? Multiple threads must be able to make progress, even with fullyloaded bus Idle time is very fine grain (under one bit time) Each application domain customizes its protocols Wireless sensor networks tweak the medium access control, etc. for minimal energy use Automotive: optimize for guaranteed (hard real-time) delivery Chicken and egg problem Protocol controller chip won t appear until adequate market anticipated Chip costs remain high until volumes amortize design costs Delay until protocol controller appears as peripheral on cheap MCUs MCUs are good fit for many embedded protocols, if concurrency is cheap 10 to 200 cycles of processing needed per bit 1 kbps 1 Mbps bus speed Temporally predictable MCUs are cheap and flexible 1 MHz for $ MHz for $5-$10 (but you pay in increased energy use and other issues) 19
20 Assumptions - Small Embedded Systems Processors Not practical to design a custom processor Not practical to use fast processor (e.g. raise clock speed by 10x or more) Can handle some code explosion (e.g. up to 3x) Using a generic microcontroller (e.g. 4, 8 or 16 bit) without memory protection, virtual memory, caches. Workload 32 Bit 8% 16 Bit 12% DSP 11% At most a few threads need to make asynchronous progress, others can wait One hard-real-time thread with tight deadlines Other threads may have deadlines which are significantly longer Interrupts are delayed or handled with polling servers (CASES 2003) Subroutine calls are cloned or inlined (CASES 2004) 4 Bit 12% 8 Bit 57% 20
21 Control-Flow View of ASTI Idle Time Idle Time Idle Time Primary Function Control Flow Secondary Function Break secondary thread into segments lasting approximately for the total idle time Integrate intervening primary code into each segment Insert coroutine calls at start of idle time and end of each segment 21
22 Big Picture How do we efficiently allocate N concurrent (potentially real-time) tasks onto fewer than N processors? Compilation and scheduling for concurrent/parallel/distributed systems Real-time systems Hardware/software cosynthesis Bottlenecks Scheduling each context switch Performing each context switch We focus on 1 processor, and that processor is generic (low-cost) with no special features for accelerating context switch bottlenecks Note: threads must be able to make independent (asynchronous) progress 22
23 Steps in STI: Source Code Preparation Structure program (C) to accumulate work to perform in integrated functions Write functions (C) to be integrated Compile to assembly code, partitioning register file for functions to be integrated (-ffixed) 23
24 Control Dependence Graph Thread Representation Procedure Code Conditional Loop CDG s hierarchical structure simplifies integration Vertical = conditional nesting, Horizontal = execution order Summary information at each level Conditional Nesting Our Thrint back-end compiler operates on CDGs of host, guest threads Annotates host with execution time predictions Execution Order Moves guest code into host, enforcing ctl/data/time dependencies Find gap, or else descend into subgraph 24 Have code transformations to handle conditionals & loops
25 Parse Asm Form CFG/CDG Read Integration Directives Static Timing Analysis Node Labelling Thrint Overview STI Pad Timing Jitter in secondary thread Plan integration Pad excess timing jitter ASTI Pad Timing Jitter in message level function. Pad Timing Jitter in bit level function. Idle Time Analysis Temporal Determinism Analysis Data-Flow Analysis Register Virtualization Integration Register Reallocation Static Timing Analysis Timing Verification Code Regeneration For each guest For each host Do host loop transformations Pad excess timing jitter Clone and insert guest node(s) For each guest For each host If Fused loop, add fused loop control test code Delete original guests Plan integration in secondary thread Pad jitter in predicate nodes and blocking I/O loops Integrate cocalls within the secondary thread. Integrate intervening guest code at appropriate locations. Delete original guests 25
26 Steps in STI: Analysis and Integration Planning Parse assembly code to form CFG and then CDG Perform tree-based static timing analysis Pad away timing variations from conditionals with nops or nop loops (example) Perform basic data-flow analysis to identify loop-control variables and possibly iteration counts Compare duration of primary functions with maximum allowed latency for ISRs and other short-laxity tasks Create polling servers to handle these as needed Compare duration of secondary functions with amount of idle time time in primary functions, considering minimum period for primary function Break long secondary functions into segments which fit into primary functions idle time minus polling servers minus two context switch times Also end segments when reaching a loop with an unknown iteration count Define target times for regions in primary code which are time-critical 26
27 Steps in STI: Integration Note: conditionals have been padded away previously Single primary events Move primary code to execute at proper times within secondary code Replicate primary code into conditionals Split and peel loops and insert primary code Guard primary code within loop to trigger on given iteration Looping primary events Peel off primary function loop iterations which don t overlap with secondary loops Integrate as single primary events Fuse loop iterations which do overlap Fuse loop control tests Unroll loop to match idle time in primary loop to work in secondary loop Create clean-up loops to perform remaining iterations Redo static timing analysis and verify correct timing Recreate assembly file Compile, link, download and run! 27
28 Protocol Software Structure protocol_executive() send idle receive send_message() send_bit() receive_message() receive_bit() Most idle time is located in these functions 28
29 What about Interrupts? What about Frequent Primaries and Long Secondaries? Interrupts? STI disables interrupts while integrated threads run STIGLitz: ints. disabled for one field of video ( ms) Frequent primaries and long secondaries? Primary thread needs to run again before integrated version would finish Solutions Use polling servers to service each non-deferrable thread (e.g. UART) Break secondary into segments and integrate primary in multiple times Laxity for Secondary Thread (max. latency allowed) Worst case execution time of integrated thread Minimum primary thread period minus maximum primary thread work WCET for Secondary Thread 29
30 Detail: Register File Partitioning vs. Performance Problem: STI requires that integrated threads share the register file Trade-off: Code compiled to fit into fewer registers switches contexts faster Dispatcher switches contexts roughly every 900 cycles Two context switches for one register take 12 cycles Code compiled to fit into fewer registers runs slower More variables must remain in memory Goal: Squeeze pre-integrated threads into as few registers as practical Method: Determine sensitivity of the host threads execution time to the number of registers available Divide AVR registers into three classes: Pointer registers (r26-r31) Immediate-operand capable registers (r16-r25) Other registers (r0-r15) Analyze DrawSprite, DrawLine, DrawCircle functions Limit registers available to the register allocator through gcc s ffixed option. Measure execution time using an on-chip timer/counter 30
31 Measurements Results DrawLine and DrawCircle not very sensitive DrawSprite very sensitive Strange speed-up when excluding one pointer register Design decisions DrawLine and DrawCircle Exclude eight other" registers and two pointer registers Use 22 registers Each context switch: 132 cycles DrawSprite Exclude only one other register and two pointer registers Use 29 registers Each context switch: 174 cycles Normalized Run Time Normalized Run Time Normalized Run Time Draw_Circle Sensitivity to Register Exclusion DrawCircle - Immediate DrawCircle - Pointer DrawCircle - Other Total Registers Excluded Draw_Line Sensitivity to Register Exclusion DrawLine - Pointer DrawLine - Other DrawLine - Immediate Total Registers Excluded Draw_Sprite Registers_removed DrawSprite - Immediate DrawSprite - Pointer DrawSprite - Other 31
32 To Do Remove intervening code from primary code - animate 32
Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration
Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration RTSS 2002 -- December 3-5, Austin, Texas Alex Dean Center for Embedded Systems Research Dept. of ECE, NC State
More informationBalancing Register Pressure and Context-Switching Delays in ASTI Systems
Balancing Register Pressure and Context-Switching Delays in ASTI Systems Siddhartha Shivshankar, Sunil Vangara and Alexander G. Dean Center for Embedded Systems Research Department of Electrical and Computer
More informationImproving Energy-Efficiency Efficiency in Sensor Networks by Raising Communication Throughput Using Software Thread Integration
Improving Energy-Efficiency Efficiency in Sensor Networks by Raising Communication Throughput Using Software Thread Integration Ramnath Venugopalan and Alexander Dean Center for Embedded Systems Research
More informationCompiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration
Compiling for Fine-Grain Concurrency: Planning and Performing Software Thread Integration Alexander G. Dean Center for Embedded Systems Research Department of Electrical and Computer Engineering North
More informationCS425 Computer Systems Architecture
CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationComplementing Software Pipelining with Software Thread Integration
Complementing Software Pipelining with Software Thread Integration LCTES 05 - June 16, 2005 Won So and Alexander G. Dean Center for Embedded System Research Dept. of ECE, North Carolina State University
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationCARNEGIE MELLON UNIVERSITY SOFTWARE THREAD INTEGRATION FOR HARDWARE TO SOFTWARE MIGRATION
CARNEGIE MELLON UNIVERSITY SOFTWARE THREAD INTEGRATION FOR HARDWARE TO SOFTWARE MIGRATION A DISSERTATION SUBMITTED TO THE GRADUATE SCHOOL IN PARTIAL FULFILLMENT OF THE REQUIREMENTS for the degree DOCTOR
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationCo-synthesis and Accelerator based Embedded System Design
Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer
More informationSoftware-Controlled Multithreading Using Informing Memory Operations
Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University
More informationEnhancing the AvrX Kernel with Efficient Secure Communication Using Software Thread Integration
Enhancing the AvrX Kernel with Efficient Secure Communication Using Software Thread Integration Prasanth Ganesan and Alexander G. Dean Center for Embedded Systems Research Dept. of Electrical and Computer
More information250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019
250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr
More informationEmbedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.
Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors
More informationIn examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured
System Performance Analysis Introduction Performance Means many things to many people Important in any design Critical in real time systems 1 ns can mean the difference between system Doing job expected
More informationSystem-Level Issues for Software Thread Integration: Guest Triggering and Host Selection
Published in the Proceedings of the 20th IEEE Symposium on Real-Time Systems, December 1-3, 1999, Phoenix, Arizona System-Level Issues for Software Thread Integration: Triggering and Host Selection Alexander
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationAdvanced Instruction-Level Parallelism
Advanced Instruction-Level Parallelism Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationABSTRACT. VASANTH ASOKAN, Relaxing Control Flow Constraints in ASTI (Under the direction of Dr. Alexander G. Dean)
ABSTRACT VASANTH ASOKAN, Relaxing Control Flow Constraints in ASTI (Under the direction of Dr. Alexander G. Dean) Asynchronous Software Thread Integration (ASTI) provides methods to reclaim sub-bit duration
More informationProcessing Unit CS206T
Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct
More informationENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design
ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University
More informationModule 4c: Pipelining
Module 4c: Pipelining R E F E R E N C E S : S T A L L I N G S, C O M P U T E R O R G A N I Z A T I O N A N D A R C H I T E C T U R E M O R R I S M A N O, C O M P U T E R O R G A N I Z A T I O N A N D A
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationOutline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches
Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationPrinciples in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008
Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.
More informationAssuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?
1. Imagine we have a non-pipelined processor running at 1MHz and want to run a program with 1000 instructions. a) How much time would it take to execute the program? 1 instruction per cycle. 1MHz clock
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationFull Datapath. Chapter 4 The Processor 2
Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory
More informationPipelining, Branch Prediction, Trends
Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping
More informationDepartment of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri
Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationLecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)
Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling
More informationMain Points of the Computer Organization and System Software Module
Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationGetting CPI under 1: Outline
CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationControl Hazards. Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationChapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture
An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program
More informationFinal Lecture. A few minutes to wrap up and add some perspective
Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationCPSC 313, 04w Term 2 Midterm Exam 2 Solutions
1. (10 marks) Short answers. CPSC 313, 04w Term 2 Midterm Exam 2 Solutions Date: March 11, 2005; Instructor: Mike Feeley 1a. Give an example of one important CISC feature that is normally not part of a
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationIn-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution
In-order vs. Out-of-order Execution In-order instruction execution instructions are fetched, executed & committed in compilergenerated order if one instruction stalls, all instructions behind it stall
More informationChapter 13 Reduced Instruction Set Computers
Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationVLIW/EPIC: Statically Scheduled ILP
6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics (Sections B.1-B.3, 2.1) 1 Problem 3 Consider the following LSQ and when operands are available. Estimate
More informationCOMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in
More informationHomework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures
Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationComputer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can
More informationMemory Management. Dr. Yingwu Zhu
Memory Management Dr. Yingwu Zhu Big picture Main memory is a resource A process/thread is being executing, the instructions & data must be in memory Assumption: Main memory is infinite Allocation of memory
More informationBeyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy
EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery
More informationCS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07
CS311 Lecture: Pipelining, Superscalar, and VLIW Architectures revised 10/18/07 Objectives ---------- 1. To introduce the basic concept of CPU speedup 2. To explain how data and branch hazards arise as
More informationInstruction Level Parallelism (ILP)
1 / 26 Instruction Level Parallelism (ILP) ILP: The simultaneous execution of multiple instructions from a program. While pipelining is a form of ILP, the general application of ILP goes much further into
More informationRapidly Developing Embedded Systems Using Configurable Processors
Class 413 Rapidly Developing Embedded Systems Using Configurable Processors Steven Knapp (sknapp@triscend.com) (Booth 160) Triscend Corporation www.triscend.com Copyright 1998-99, Triscend Corporation.
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationHardware-Based Speculation
Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register
More informationLecture 13 - VLIW Machines and Statically Scheduled ILP
CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw
More informationSimultaneous Multithreading: a Platform for Next Generation Processors
Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt
More informationCS 614 COMPUTER ARCHITECTURE II FALL 2005
CS 614 COMPUTER ARCHITECTURE II FALL 2005 DUE : November 9, 2005 HOMEWORK III READ : - Portions of Chapters 5, 6, 7, 8, 9 and 14 of the Sima book and - Portions of Chapters 3, 4, Appendix A and Appendix
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationLecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )
Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target
More informationCOMPUTER ORGANIZATION AND DESI
COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler
More informationLecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)
Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationHardware/Software Co-design
Hardware/Software Co-design Zebo Peng, Department of Computer and Information Science (IDA) Linköping University Course page: http://www.ida.liu.se/~petel/codesign/ 1 of 52 Lecture 1/2: Outline : an Introduction
More informationEXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu
Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points
More informationLecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.
Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 1 Consider the following LSQ and when operands are
More informationDepartment of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.
Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp Outlook Hardware is Source of Unpredictability Caches Pipeline
More informationStatic vs. Dynamic Scheduling
Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationControl Hazards. Branch Prediction
Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional
More informationUnder the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world.
Under the Compiler's Hood: Supercharge Your PLAYSTATION 3 (PS3 ) Code. Understanding your compiler is the key to success in the gaming world. Supercharge your PS3 game code Part 1: Compiler internals.
More informationI/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13
I/O Handling ECE 650 Systems Programming & Engineering Duke University, Spring 2018 Based on Operating Systems Concepts, Silberschatz Chapter 13 Input/Output (I/O) Typical application flow consists of
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of
More informationAdvanced processor designs
Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationMemory: Overview. CS439: Principles of Computer Systems February 26, 2018
Memory: Overview CS439: Principles of Computer Systems February 26, 2018 Where We Are In the Course Just finished: Processes & Threads CPU Scheduling Synchronization Next: Memory Management Virtual Memory
More informationAnalyzing Real-Time Systems
Analyzing Real-Time Systems Reference: Burns and Wellings, Real-Time Systems and Programming Languages 17-654/17-754: Analysis of Software Artifacts Jonathan Aldrich Real-Time Systems Definition Any system
More informationIntroduction to Embedded Systems
Stefan Kowalewski, 4. November 25 Introduction to Embedded Systems Part 2: Microcontrollers. Basics 2. Structure/elements 3. Digital I/O 4. Interrupts 5. Timers/Counters Introduction to Embedded Systems
More informationCSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)
CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that
More information