Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Similar documents
HARDWARE IN REAL-TIME SYSTEMS HERMANN HÄRTIG, WITH MARCUS VÖLP

HARDWARE IN REAL-TIME SYSTEMS HERMANN HÄRTIG, MARCUS VÖLP

Microkernels and Portability. What is Portability wrt Operating Systems? Reuse of code for different platforms and processor architectures.

Cache Performance (H&P 5.3; 5.5; 5.6)

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

ECE 3055: Final Exam

Virtual Memory. Virtual Memory

CS450/550 Operating Systems

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Last class: Today: Course administration OS definition, some history. Background on Computer Architecture

EE108B Lecture 17 I/O Buses and Interfacing to CPU. Christos Kozyrakis Stanford University

The Nios II Family of Configurable Soft-core Processors

5 Solutions. Solution a. no solution provided. b. no solution provided

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

Lecture 2: September 9

Topic 18 (updated): Virtual Memory

ELE 758 * DIGITAL SYSTEMS ENGINEERING * MIDTERM TEST * Circle the memory type based on electrically re-chargeable elements

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Last 2 Classes: Introduction to Operating Systems & C++ tutorial. Today: OS and Computer Architecture

Topic 18: Virtual Memory

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Example Networks on chip Freescale: MPC Telematics chip

CSE 351. Virtual Memory

Computer Systems Architecture I. CSE 560M Lecture 18 Guest Lecturer: Shakir James

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

4. Hardware Platform: Real-Time Requirements

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

Multithreaded Processors. Department of Electrical Engineering Stanford University

Chapter 8: Memory-Management Strategies

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Lecture 20: Memory Hierarchy Main Memory and Enhancing its Performance. Grinch-Like Stuff

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

Architectural Support for Operating Systems

Handout 4 Memory Hierarchy

Chapter 8: Main Memory

Age nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications

Lecture 18: Memory Hierarchy Main Memory and Enhancing its Performance Professor Randy H. Katz Computer Science 252 Spring 1996

Architecture and OS. To do. q Architecture impact on OS q OS impact on architecture q Next time: OS components and structure

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Chapter 8: Main Memory

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Memory Hierarchies 2009 DAT105

Operating Systems. 09. Memory Management Part 1. Paul Krzyzanowski. Rutgers University. Spring 2015

Input/Output. Today. Next. Principles of I/O hardware & software I/O software layers Disks. Protection & Security

Embedded Systems. 7. System Components

Introduction to Operating Systems. Chapter Chapter

ATLAS: A Chip-Multiprocessor. with Transactional Memory Support

ARM Cortex core microcontrollers 3. Cortex-M0, M4, M7

Lec 13: Linking and Memory. Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University. Announcements

ECE 571 Advanced Microprocessor-Based Design Lecture 3

Introduction to Operating. Chapter Chapter

Course Introduction. Purpose: Objectives: Content: 27 pages 4 questions. Learning Time: 20 minutes

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Main Memory (Fig. 7.13) Main Memory

Modern Computer Architecture

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems


Computer System Overview OPERATING SYSTEM TOP-LEVEL COMPONENTS. Simplified view: Operating Systems. Slide 1. Slide /S2. Slide 2.

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

FPGA Adaptive Software Debug and Performance Analysis

smxnand RTOS Innovators Flash Driver General Features

Chapter 8. Virtual Memory

Introduction to Operating Systems. Chapter Chapter

Memory and multiprogramming

14 May 2012 Virtual Memory. Definition: A process is an instance of a running program

MEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming

CS 153 Design of Operating Systems Winter 2016

Intelop. *As new IP blocks become available, please contact the factory for the latest updated info.

CSE 120 Principles of Operating Systems

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)

Product Technical Brief S3C2440X Series Rev 2.0, Oct. 2003

Even coarse architectural trends impact tremendously the design of systems

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

RISC-V Core IP Products

Address Translation. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

MICROPROCESSOR TECHNOLOGY

ARM CORTEX-R52. Target Audience: Engineers and technicians who develop SoCs and systems based on the ARM Cortex-R52 architecture.

20-EECE-4029 Operating Systems Fall, 2015 John Franco

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs

Computer Organization and Microprocessors SYLLABUS CHAPTER - 1 : BASIC STRUCTURE OF COMPUTERS CHAPTER - 3 : THE MEMORY SYSTEM

UC Santa Barbara. Operating Systems. Christopher Kruegel Department of Computer Science UC Santa Barbara

1. Memory technology & Hierarchy

I/O. Fall Tore Larsen. Including slides from Pål Halvorsen, Tore Larsen, Kai Li, and Andrew S. Tanenbaum)

I/O. Fall Tore Larsen. Including slides from Pål Halvorsen, Tore Larsen, Kai Li, and Andrew S. Tanenbaum)

Introduction to Embedded System Processor Architectures

COS 318: Operating Systems. Virtual Memory and Address Translation

Architecture of Computers and Parallel Systems Part 2: Communication with Devices

COSC 6385 Computer Architecture - Memory Hierarchy Design (III)

Transcription:

Department of Computer Science, Institute for System Architecture, Operating Systems Group Real-Time Systems '08 / '09 Hardware Marcus Völp

Outlook Hardware is Source of Unpredictability Caches Pipeline Interrupt Latency Special Purpose Hardware for Embedded Real-Time Systems Low Latency Interrupt Mode Peripheral Event Controller Capture Compare Units Scratchpad Memories Real-Time Clocks Use unpredictable Hardware more predictable Cache Partitioning Real-Time Communication / Buses in separate Lecture December 4, 2008 Marcus Völp 2

Sources of Unpredictability Memory Subsystem TLB Caches Store Buffers store R1, 0x01003025 0x1003025 ^R1 va: 01003 025 302025 Cache: TLB: 01003 302 3020 hit PT Walker miss pa: 302 025 Level 2 - Cache Memory Controller Main Memory December 4, 2008 Marcus Völp 3

Sources of Unpredictability Processor Pipeline IF IF DE ISS ALU MAC MAC MAC ADD DC DC WB WB WB (ARM 11 Microarchitecture) December 4, 2008 Marcus Völp 4

Sources of Unpredictability Interrupt Latency Device incoming signal Interrupt Controller CPU interrupts enabled? await instruction boundary kernel entry select handler code (index into interrupt vector table) handle_irq:... return from interrupt December 4, 2008 Marcus Völp 5

Interrupt Response Time RTLinux Frequency 1 0.1 0.01 0.001 log.fl.josephina.rtlinux: a Interrupt response time: Time from occurrence of interrupt to first instruction of RT Task 0.0001 1 0.1 0.01 Frequency 0.001 0.0001 0 5 10 15 20 25 30 35 40 45 50 IRQ occurence rate (µs) log.t3.josephina.rtlinux: lat_proc_sh_dynamic No parallel load (idle): 13 µs High parallel load: 68 µs (Benchmark, Cache-Flodder) Measured on Intel P4 1.6 GHz 0 5 10 15 20 25 30 35 40 45 50 IRQ occurence rate (µs) December 4, 2008 Marcus Völp 6

Interrupt Response Time L4RTL (+ AS switch) Frequency 1 0.1 0.01 0.001 log.fl.josephina.fiasco: a Interrupt response time: Time from occurrence of interrupt to first instruction of RT Task 0.0001 No parallel load (idle): 43 µs 0 5 10 15 20 25 30 35 40 45 50 IRQ occurence rate (µs) log.t3.josephina.fiasco: lat_proc_sh_dynamic 1 0.1 0.01 High parallel load: 85 µs (Benchmark, Cache-Flodder) Frequency 0.001 0.0001 Measured on Intel P4 1.6 GHz 0 5 10 15 20 25 30 35 40 45 50 IRQ occurence rate (µs) December 4, 2008 Marcus Völp 7

Outlook Hardware is Source of Unpredictability Caches Pipeline Interrupt Latency Special Purpose Hardware for Embedded Real-Time Systems Low Latency Interrupt Mode Peripheral Event Controller Capture Compare Units Scratchpad Memories Real-Time Clocks Use unpredictable Hardware more predictable Cache Partitioning Real-Time Communication / Buses in separate Lecture December 4, 2008 Marcus Völp 8

Low Latency Interrupts Sampling of Data in a High Rate (Sensors, Video,...) Fast Response Times (Break Control,...) ARM IRQ + FIQ FIQ interrupts IRQ handler 5 immediately available registers for FIQ handler no need to save register content ARM Low Interrupt Latency Configuration Minimize worst case interrupt latency disable Hit-under-Miss in Cache abandon pending restartable memory OPs restart memory OP on return from interrupt software must not use multi-word load / store instructions software should avoid accesses to slow memory (device memory, strong ordering of memory accesses) Worst Case FIQ Latency (PL190 VIC) : Interrupt synchronization 3 cycles Worst case execution time of current instruction 7 cycles Entry to first instruction 2 cycles / 12 cycles December 4, 2008 Marcus Völp 9

Low Latency Interrupts Obtaining precise timestamps with interrupt-based sampling is difficult system load increases jitter in interrupt response time atomic sections in OS delay interrupts jitter in signal delivery (from signal source to CPU interrupt pin) buses, interrupt controller Peripheral Event Controller (PEC) executes one memory-to-memory move triggered by arriving signal Programmers Interface: counter decrement on signal, trigger CPU interrupt on 0 byte / word select select width of transfer (byte or word) source / dest increase update source / destination address source / dest address signal () { if (counter--) trigger_interrupt(); else *dest++ = *src; } => minimal response time (~3 cycles) => PEC can execute at much lower clock rates December 4, 2008 Marcus Völp 10

Capture + Compare Unit Problem precise timestamp of event (engine sensor triggered at cylinder position = t) generate precise events (fire the engine at t + x) interrupt handlers have too high a jitter to meet the points in time Capture Compare event interrupt handler Compare Register Capture Register = event Clock Clock December 4, 2008 Marcus Völp 11

System on a Chip / Chip Multiprocessor Processor Chip contains entire system CPU Cores / Peripheral Components / Memory available as masks for weaver Configure system as required Examples: Mobile Phones today: General Purpose Core (e.g., ARM) + Baseband DSP Processor Game Console (Cell): General Purpose (PPC) + special purpose Graphics Sensor + 8-bit Controller + CAN bus December 4, 2008 Marcus Völp 12

Scratch-pad Memory vs. Caches Synonyms: Scratchpad / on Chip / Tightly Coupled Memory Cache: transparent addressing scheme cache controller fills / replaces cachelines in parallel to CPU activity difficult to predict worst-case execution time Main Memory L2 L1 December 4, 2008 Marcus Völp 13

Scratch-pad Memory vs. Caches Synonyms: Scratchpad / on Chip / Tightly Coupled Memory Cache: transparent addressing scheme cache controller fills / replaces cachelines in parallel to CPU activity difficult to predict worst-case execution time Main Memory L2 L1 December 4, 2008 Marcus Völp 14

Scratch-pad Memory vs. Caches Synonyms: Scratchpad / on Chip / Tightly Coupled Memory Cache: transparent addressing scheme cache controller fills / replaces cachelines in parallel to CPU activity difficult to predict worst-case execution time Main Memory L2 L1 December 4, 2008 Marcus Völp 15

Scratch-pad Memory vs. Caches Synonyms: Scratchpad / on Chip / Tightly Coupled Memory Cache: transparent addressing scheme cache controller fills / replaces cachelines in parallel to CPU activity difficult to predict worst-case time Main Memory L2 L1 CPU 1 cycle ~10 cycles ~50 cycles December 4, 2008 Marcus Völp 16

Scratch-pad Memory vs. Caches Way 1 Way 2 Set = = Cache lockdown: Tag RAM Data Lock cacheway Allocate only to unlocked ways December 4, 2008 Marcus Völp 17

Scratch-pad Memory vs. Caches Scratchpad Memory: memory can be addressed directly (device mapped to physical memory space) 2 cycles latency / currently ~ 64 KB (more ~4 MB to come) must relocate data explicitly Compiler-based approaches: static: determine code + data hot spots + allocate scratchpad memory for hot spots dynamic: copy data to / from scratchpad memory int a[20]; int a[20]; int b[5] = {a[5], a[6], a[7], a[8], a[9]}; => for (int j = 0; j < 40; j++) for (int j = 0; j < 40; j++) for (int i = 5; i < 10; i++) for (int i = 0; i < 10; i++) a[i] *= 42; b[i] *= 42; December 4, 2008 Marcus Völp 18

Scratch-pad Memory vs. Caches ARM Tightly Coupled Memory: RAM TCM RAM TCM overlay normal RAM with TCM simplified cache logic for TCM memory keeps track whether RAM or TCM holds current value on miss: copies data from underlying RAM to TCM December 4, 2008 Marcus Völp 19

Real Time Clocks Multiple Clocks in SoC Core Cycle Count 200 MHz fine grained + fast to access accuracy ; clock skew ; subject to power management Audio Codec PCI 33 MHz/ 66 MHz Timer RTC 32.768 khz ; high precision ; small clock skew need to synchronize clocks within core / system wide ~> separate lecture December 4, 2008 Marcus Völp 20

Outlook Hardware is Source of Unpredictability Caches Pipeline Interrupt Latency Special Purpose Hardware for Embedded Real-Time Systems Low Latency Interrupt Mode Peripheral Event Controller Capture Compare Units Scratchpad Memories Real-Time Clocks Use Unpredictable Hardware more predictable Cache Partitioning Real-Time Communication / Buses in separate Lecture December 4, 2008 Marcus Völp 21

OS Controlled Cache Predictability goals partition cache to minimize interference between RT Applications OS controlled transparent to application no need to rely on cooperating applications survives malicious, erroneous applications Cache Partition Address regions that map to a given set of cachelines (a partition) tool support: Compiler / Linker December 4, 2008 Marcus Völp 22

OS Controlled Cache Predictability Cache Coloring 16 bit address (binary representation): 1001 0011 0010 0100 tag idx offset into 32 byte cacheline 0x93 Cache Size +0 +32 +64 +96 +128 +160 +192 +224 December 4, 2008 Marcus Völp 23

OS Controlled Cache Predictability Cache Coloring Address translation and caches virtual page number virtual address: 0x40003 864 MMU physical frame number physical address: 0x30a 864 offset 0000 0000 0011 0000 0010 1000 0110 1000 0000 0000 0011 0000 0010 1000 011 0 1000 tag idx ofs L2 Cache: 4MB, 4 Way, 32byte CL December 4, 2008 Marcus Völp 24

OS Controlled Cache Predictability Cache Coloring Address translation and caches virtual page number virtual address: 0x40003 864 MMU physical frame number physical address: 0x30a 864 offset 0000 0000 0011 0000 0010 1000 0110 1000 0000 0000 0011 0000 0010 1000 011 0 1000 tag idx ofs L2 Cache: 4MB, 4 Way, 32byte CL 2 bits: subject to address translation evaluated to determine cache set => assign different colors to different RT + non RT Apps. => allocate in the OS memory frames of respective color December 4, 2008 Marcus Völp 25

OS Controlled Cache Predictability Scenario high priority real-time task (color = 00) runs in frequent short intervals other tasks in the backgound (e.g., cache flodder) Example: Filter Worst case: wo. cache partitioning: background tasks may evict all cachelines background tasks may load conflicting cachelines, which need writeback Flush Flush Flush December 4, 2008 Marcus Völp 26

OS Controlled Cache Predictability 400% 350% Filter PentiumPro, 133 MHz 300% 250% 200% 150% unpartitioned sngl wr part partintioned 100% 4 8 16 32 64 128 256 512 1024 2048 4096 (sngl wr = single writer only RT app may write to pages of color x ; others may read) December 4, 2008 Marcus Völp 27

OS Controlled Cache Predictability Example 2: low priority task; frequently interrupted e.g., matrix multiplication Flush Flush Flush Flush timeslice 64 x 64 Matrix Multiplication 64 3 x 5.51 cycles: 10.9 ms December 4, 2008 Marcus Völp 28

OS Controlled Cache Predictability 1700% 1600% 1500% 1400% 1300% 1200% 1100% 1000% 900% 260% 240% 220% 200% 180% 160% 140% 120% 64 x 64 Matrix Multiplication Pentium, 133 MHz 800% 700% 600% 500% 400% 300% 100% 300 400 500 640 768 896 1024 [µs] unpartitioned sngl wr part partitioned 200% 100% 50 100 150 200 250 300 350 400 450 December 4, 2008 timeslice [µs] Marcus Völp 29

OS Controlled Cache Predictability 500% 450% 400% 135% 130% 125% 120% 115% 64 x 64 Matrix Multiplication Pentium, 133 MHz write through 350% 110% 300% 105% 250% 100% 300 400 500 640 768 896 1024 [µs] 200% unpartitioned partitioned 150% 100% 50 100 150 200 250 300 350 400 450 timeslice [µs] [µs] December 4, 2008 Marcus Völp 30

OS Controlled Cache Predictability Experiment 3: Combination: Matrix Multiplication + Filter 64 x 64 Matrix Multi- plication 300 250 200 150 [ms] unpartitioned partiotioned MM + Filter 280% 260% 240% 220% 200% 180% 160% 140% 100 120% 100% 20 khz 10 khz 5 khz 3.3 khz 2 khz 50 0 20 khz 10 khz 5 khz 3.3 khz 2 khz December 4, 2008 Marcus Völp 31

OS Controlled Cache Predictability Caveat: application transparency some applications require knowledge about physical addresses e.g., drivers need physical addresses for direct memory address transfers (DMA) modify driver use recent hardware (e.g., Intel VT-d) with address translation for DMA December 4, 2008 Marcus Völp 32

Conclusions High performance CPUs are rarely used in embedded systems HW means to increase predictability SW tweaks to use unpredictable HW in a more predictable way References: Liedtke, Härtig, Hohmuth (RTAS '97): Operating system control cache predictability for real-time applications ARM Real View Emulation Board User Guide ARM 1176jzfs Manual December 4, 2008 Marcus Völp 33