Stack Machines. Towards Scalable Stack Based Parallelism. 1 of 53. Tutorial Organizer: Dr Chris Crispin-Bailey

Similar documents
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

EECS4201 Computer Architecture

Page 1. Multilevel Memories (Improving performance using a little cash )

Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

Control Hazards. Prediction

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture. Fall Dongkun Shin, SKKU

CS 654 Computer Architecture Summary. Peter Kemper

ECE 1160/2160 Embedded Systems Design. Midterm Review. Wei Gao. ECE 1160/2160 Embedded Systems Design

Control Hazards. Branch Prediction

Handout 4 Memory Hierarchy

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Why Parallel Architecture

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Lecture 1: Introduction

RISC & Superscalar. COMP 212 Computer Organization & Architecture. COMP 212 Fall Lecture 12. Instruction Pipeline no hazard.

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

ECE232: Hardware Organization and Design

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Caches Part 1

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CPU Pipelining Issues

Lecture 28 Multicore, Multithread" Suggested reading:" (H&P Chapter 7.4)"

Memory Management. Dr. Yingwu Zhu

Computer Architecture. R. Poss

(Refer Slide Time 3:31)

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

Addressing the Memory Wall

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Topics. Digital Systems Architecture EECE EECE Need More Cache?

Computer Architecture!

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

Computer Architecture

CISC-235. At this point we fnally turned our atention to a data structure: the stack

Parallel Computing. Parallel Computing. Hwansoo Han

High Performance Computing

Eastern Mediterranean University School of Computing and Technology CACHE MEMORY. Computer memory is organized into a hierarchy.

Fundamentals of Quantitative Design and Analysis

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Advanced Computer Architecture (CS620)

Παράλληλη Επεξεργασία

Computer Architecture

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

Lecture 17 Introduction to Memory Hierarchies" Why it s important " Fundamental lesson(s)" Suggested reading:" (HP Chapter

INTEL Architectures GOPALAKRISHNAN IYER FALL 2009 ELEC : Computer Architecture and Design

Problem with Scanning an Infix Expression

Slide Set 5. for ENCM 501 in Winter Term, Steve Norman, PhD, PEng

UNIVERSITY OF MORATUWA CS2052 COMPUTER ARCHITECTURE. Time allowed: 2 Hours 10 min December 2018

Advanced Computer Architecture

Problem with Scanning an Infix Expression

Garbage Collection (1)

Memory Systems IRAM. Principle of IRAM

The Implications of Multi-core

Chapter 1: Introduction to Parallel Computing

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

ECE 588/688 Advanced Computer Architecture II

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

7 Trends driving the Industry to Software-Defined Servers

LECTURE 1. Introduction

Evolution of ISAs. Instruction set architectures have changed over computer generations with changes in the

How What When Why CSC3501 FALL07 CSC3501 FALL07. Louisiana State University 1- Introduction - 1. Louisiana State University 1- Introduction - 2

Multi-Core Microprocessor Chips: Motivation & Challenges

CS/EE 6810: Computer Architecture

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 13

ECE 486/586. Computer Architecture. Lecture # 2

Caching Basics. Memory Hierarchies

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

(Refer Slide Time: 00:01:30)

Lecture 9: MIMD Architectures

Computer Architecture = CS/ECE 552: Introduction to Computer Architecture. 552 In Context. Why Study Computer Architecture?

Lecture 11 Cache. Peng Liu.

Chapter 9 Memory Management

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

East Tennessee State University Department of Computer and Information Sciences CSCI 4717 Computer Architecture TEST 3 for Fall Semester, 2005

Computer Architecture Review. ICS332 - Spring 2016 Operating Systems

Computer Science 432/563 Operating Systems The College of Saint Rose Spring Topic Notes: Memory Hierarchy

ECE 485/585 Microprocessor System Design

Question?! Processor comparison!

Parallelism: The Real Y2K Crisis. Darek Mihocka August 14, 2008

Chapter Seven Morgan Kaufmann Publishers

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

RISC Principles. Introduction

An Introduction to Parallel Programming

Copyright 2012, Elsevier Inc. All rights reserved.

Fundamentals of Computer Design

Lecture: Memory, Multiprocessors. Topics: wrap-up of memory systems, intro to multiprocessors and multi-threaded programming models

CISC

Lecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University

Reduced Instruction Set Computers

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Advanced Computer Architecture

LECTURE 5: MEMORY HIERARCHY DESIGN

Chapter 5 Large and Fast: Exploiting Memory Hierarchy (Part 1)

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

Transcription:

1 of 53 Stack Machines Towards Scalable Stack Based Parallelism Tutorial Organizer: Department of Computer Science University of York

2 of 53 Today s Speakers Dr Mark Shannon Dr Huibin Shi

3 of 53 Stack Machines Towards Scalable Stack Based Parallelism Department of Computer Science University of York

4 of 53 Tutorial Plan 1. Exploding the Myths (Chris Bailey) 2. Compilers & Code Optimization (Mark Shannon) << Coffee Break >> 3. Measuring Parallelism (Huibin Shi) 4. Parallel Architectures (Chris Bailey)

5 of 53 Core Motivation for this tutorial Can stack machines exploit parallelism? Can stack machines offer anything useful for future architectures? First, lets try to define the problem of performance in suitable terms

6 of 53 Performance A Complex Equation

7 of 53 Performance A Complex Equation

8 of 53 Performance A Complex Equation Sounds great? But how much of this is well Understood in the context of stack machines? Well, we think we know about this bit So there is a lot to be done!

9 of 53 But First.. A little history Where did stack machines come from? What happened next

10 of 53 Stack Machines in a nutshell 1920 POLISH NOTATION was invented by Jan Łukasiewicz 1957 C. Hamblin invents Reverse Polish Notation first commercial Stack Machines? KDF9-1964 Burroughs machines B5000 etc 1961 onward

11 of 53 KDF9 English Electric Company First Used 1964 Still in use in 1980!

12 of 53 1970 s Integrated circuits overtook the stack machine advantage of simplicity 1980 s RISC revolution cemented Register dominance 1990 s INMOS Transputer Stack Microprocessor 2000 s Java and Java Processors limited impact 2010 s CISC Approach In trouble can stack machines come back?

13 of 53 What If Development of Stack Machines Had Continued? 1920 1960 1970 1990 2020?

14 of 53 What If Development of Stack Machines Had Continued? 1920 1960 1970 1990 2020? What should have been done :- Address inferior stack machine limitations

15 of 53 What If Development of Stack Machines Had Continued? 1920 1960 1970 1990 2020? What should have been done :- Address inferior stack machine limitations Develop advanced code generation comparable to register machines

16 of 53 What If Development of Stack Machines Had Continued? 1920 1960 1970 1990 2020? What should have been done :- Address inferior stack machine limitations Develop advanced code generation comparable to register machines Invent methods to exploit general parallelism

17 of 53 Back to the Tutorial Plan 1. Exploding the Myths (Chris Bailey) 2. Code Optimization (Mark Shannon) << Coffee Break >> 3. Measuring Parallelism (Huibin Shi) 4. Parallel Architectures (Chris Bailey)

18 of 53 Part-1 Stack Machines Exploding the Myths

19 of 53 Stack Machines: Exploding the Myths i. The case against stack machines

20 of 53 Stack Street Register Ave. 1960 s A Turning Point For stacks IBM 360 systems designers chose Register- Based systems, citing key points that influenced their decision After 50 years we still hear the same evidence, increasingly mis-quoted. Lets start by examining the folklore

21 of 53 Stack Machines Myth and Misconception Stack machines some common responses Didn t they stopped using those in the 60 s? Nobody is looking at this area so It must be a waste of time Everyone knows stack machines don t do parallelism They were shown to be inferior a long time ago However where is the research to back this up? In truth there is very little evidence to write off stack machines after 50 years of trying we still rely upon assumption

22 of 53 Stack Machines Myth and Misconception What P&H say about stack machines 1. Performance derives from fast registers, and not the way they are used 2. Stack organization requires many swap and copy operations Patterson and Hennessey Computer Architecture A Quantitative Approach 3. The bottom of the stack (when placed main memory) is slow This is a highly influential text book which students rely upon heavily, but is it right?

23 of 53 Stack Machines Myth and Misconception Papers Quoted :- Amdahl et al 1964 Bell et al 1970 Hauck and Dent 1968 Myers 1978 Patterson and Hennessey Computer Architecture A Quantitative Approach A little out of date perhaps?

24 of 53 Stack Machines Myth and Misconception Major Changes since 1970:- INTEL invented the microprocessor! We stopped using core store for memory Nearly all modern programming languages were invented Patterson and Hennessey Computer Architecture A Quantitative Approach Register allocation, virtual register sets, superscalar issue, cache, burst-dram, multilevel memory, 1-billion transistor chips Multicore, scaling, power density wall, Moores law, etc

25 of 53 Stack Machines Myth and Misconception So at the very least we should take an objective look at this key research before allowing our minds to be made up for us? Were these claims true in 1970? Are they still true now? Patterson and Hennessey Computer Architecture A Quantitative Approach

26 of 53 Stack Machines Myth and Misconception 1. Performance derives from fast registers, and not the way they are used What Amdahl et al actually said Amdahl 1964 as quoted by P&H the performance advantage of stack processors comes from fast registers, not how they are used Amdahl 1964 This doesn t seem to support any idea of stack machines being inferior, indeed Amdahl highlights the advantage of stack machines offering fast registers an issue that is still critical in today s CPU designs.

27 of 53 Stack Machines Myth and Misconception 2. Stack organization requires many swap and copy operations Amdahl 1964 as quoted by P&H In the paper it says about 50% of operands appear at the top of stack when required This is true if one relies upon naïve compiler output with no relevant optimization. A register based machine would have the same problem without register allocation and graph coloring. But if stack code is inefficient then the solution is to optimize it, not abandon the architecture as a useful design (Part 2 will expand on this much recent work has been done)

28 of 53 Stack Machines Myth and Misconception 3. The bottom of the stack (when placed main memory) is slow. This is true, but so is the following :- Bell 1970 A register file is very slow if half of it is placed in main memory But we do not have to put anything in main memory!! On chip stack buffers (perfectly possible in 1960) On chip cache (a recent innovation so the problem is a false proposition in terms of today s hardware Systems and was probably avoidable even 40 years ago.

29 of 53 Stack Machines Myth and Misconception x 1. Performance derives from fast registers, and not the way they are used. x x 2. Stack organization requires many swap and copy operations 3. The bottom of the stack (when placed main memory) is slow. The case against stack machines is no longer convincing

30 of 53 Stack Machines Myth and Misconception THE EVIDENCE IS IN THE RESEARCH New Code optimization techniques - Koopman, Bailey, Shannon, Maierhofer. Stack Buffering and Cache Hierarchy - Many examples, many alternative technologies Instruction sets that optimise operand management- Bailey orthogonal stack management

31 of 53 Stack Machines Myth and Misconception Conclusion Perceptions of stack machine technology is severely outdated We have been diverted from more interesting questions - Can they fully exploit parallelism? - Do they make multicore easier? - Can they address power-density or wire-scaling issues? - Can they approach register file performance gate for gate?

32 of 53 Stack Machines Myth and Misconception Conclusion But, to get closer to answering these problems we need to show 1. Stack machines can effectively manage stack spill traffic 2. Stack machines can optimize code to eliminate poor efficiency 3. Stack Machines can exploit useful degrees of parallelism This will be shown to be feasible in the tutorial.

33 of 53 Stack Machines: Stack Buffering Methods

34 of 53 Stack Buffering Schemes The stack has a bottom, and it is inefficient when placed in memory Making this statement today ignores fundamental computer architecture, as well as more advanced techniques, It is useful to understand the possible methods by which the problem can be addressed.

35 of 53 stack Stack Buffering The problem as it was presented Cpu core Bell perceived the problem in terms of 1960 s computer architecture We had a register bank on chip and a slow main memory Where else could the stack go? Main Memory There are lots of options some were perfectly feasible even in the 1960 s.

36 of 53 stack Stack Buffering Some of the options Cpu core 1. Simplest Solution:- Don t spill stack to memory Stack has finite depth INMOS TRANSPUTER Intel 4004 program stack Main Memory Has no memory problem

37 of 53 stack Stack Buffering Some of the options Cpu core 2. Just ignore the problem :- always spill stack to memory very inefficient as Bell Observed 1960 s stack machines Main Memory This is the worst possible implementation Not a very fair comparison!

38 of 53 Stack Buffering Some of the options stack Cpu core Location Location! Stack spill traffic has high degree of locality Pushes and pops tend to cancel each other Push Effect Main Memory Pop Effect Spill traffic repeatedly references the same memory addresses (i.e. often redundant)

39 of 53 Stack Buffering Some of the options stack Cpu core 3. Cache Will eliminate redundant Accesses caching will absorb most spills may reduce cache bandwidth External Cache Main Memory There are many variations. Write Through Write Back Line based read/write policies We have had these components for decades!

40 of 53 Stack Buffering Some of the options stack Cpu core The way in which we implement cache for stack spilling is open to many permutations External Cache Main Memory Single Shared Cache

41 of 53 stack Stack Buffering Some of the options Cpu core Cpu core On chip cache External Cache External Cache Main Memory Main Memory Single Shared Cache Multi-Level Cache

42 of 53 stack Stack Buffering Some of the options Cpu core Cpu core Cpu core On chip cache buffer cache External Cache External Cache External Cache Main Memory Main Memory Main Memory Single Shared Cache Multi-Level Cache Split Cache schemes

43 of 53 Stack Buffering Some of the options But even standard cache is a poor solution We can use simpler buffer schemes to gain Benefit with much fewer gates/transistors buffer Cpu core cache External Cache Main Memory

44 of 53 Stack Buffering Dedicated Buffers Many buffer schemes exist Much more compact that equivalent cache Less silicon area Lower power consumption buffer Cpu core cache External Cache Main Memory

45 of 53 Stack Buffering Dedicated Buffers Many buffer schemes exist Much more compact that equivalent cache Less silicon area Lower power consumption buffer Cpu core cache Typically Very sharp fall off curve :- Spill Traffic % 100 80 60 40 20 0 External Cache Main Memory 0 1 2 3 4 5 6 7 8 Buffer Capacit y

46 of 53 Stack Buffering Dedicated Buffers Cpu core buffer cache SPILL FILL Far from every stack operation creating a spill, these events are rare even with a small hysterisis buffer. This can be implemented with a simple pointer something Bell could have utilized even in the limited 1960 s hardware External Cache Main Memory

47 of 53 Stack Buffering Dedicated Buffers Buffer Schemes can be very simple:- Single pointer Double pointer Valid and Dirty Bits Spill Traffic % 100 80 60 40 20 0 Hugely simplified by comparison to an addressable memory area or a cache Even small buffers ( 4 or 8 elements) give Large reductions in stack spill traffic. 0 1 2 3 4 5 6 7 8 Buffer Capacit y

48 of 53 Stack Buffering Dedicated Buffers Why so effective? Locality of Reference Stack spills & fills are tightly coupled Spill Traffic % 100 80 60 40 20 0 0 1 2 3 4 5 6 7 8 Buffer Capacity Buffers soak up repeated reads and writes to the same location - this is a valuable point for superscalar stack machines (see later) Downside is that buffers can increase task switch latency. - Alternatives include traditional cache or memory queues - these approaches have reduced context switching latency - but they also introduce non-determinism and warm up effects

49 of 53 Conclusions? Stack machines are inefficient when the stack bottom is in memory - The claim is correct, but the scenario is contrived - The answer is do not put the stack in memory!! - Use a cache, multiple caches, and/or a buffer - this is standard in most CPU s today - Any future comparison of Stack vs Register MUST reflect this simple solution to be taken seriously. X?

50 of 53 Pause For Reflection. 1. Performance derives from fast registers, and not the way they are used 2. Stack organization requires many swap and copy operations 3. The bottom of the stack (when placed main memory) is slow.

51 of 53 Pause For Reflection. 1. Performance derives from fast registers, and not the way they are used 2. Stack organization requires many swap and copy operations We will address problem 2 next, and come back to problem 1 later on.

52 of 53 End of Part 1 Questions?

53 of 53 Tutorial Plan 1. Exploding the Myths (Chris Bailey) 2. Code Optimization (Mark Shannon) << Coffee Break >> 3. Measuring Parallelism (Huibin Shi) 4. Parallel Architectures (Chris Bailey)