Database Systems and Modern CPU Architecture

Similar documents
Data Processing on Modern Hardware

Data Processing on Modern Hardware

Computer Architecture Review. Jo, Heeseung

INSTRUCTION LEVEL PARALLELISM

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

Advanced Computer Architecture

EECC551 Exam Review 4 questions out of 6 questions

Exploitation of instruction level parallelism

CISC 662 Graduate Computer Architecture Lecture 16 - Cache and virtual memory review

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

Tutorial 11. Final Exam Review

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Pipelining, Instruction Level Parallelism and Memory in Processors. Advanced Topics ICOM 4215 Computer Architecture and Organization Fall 2010

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CPSC 330 Computer Organization

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory. Principle of Locality. It is impossible to have memory that is both. We create an illusion for the programmer. Employ memory hierarchy

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

TDT 4260 lecture 7 spring semester 2015

The Processor: Instruction-Level Parallelism

ECE232: Hardware Organization and Design

Multiple Issue ILP Processors. Summary of discussions

CS3350B Computer Architecture

EN2910A: Advanced Computer Architecture Topic 02: Review of classical concepts

Donn Morrison Department of Computer Science. TDT4255 ILP and speculation

COURSE DELIVERY PLAN - THEORY Page 1 of 6

Computer Architecture. Fall Dongkun Shin, SKKU

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Getting CPI under 1: Outline

COURSE DELIVERY PLAN - THEORY Page 1 of 6

Fundamentals of Computer Systems

Lecture Topics. Principle #1: Exploit Parallelism ECE 486/586. Computer Architecture. Lecture # 5. Key Principles of Computer Architecture

Week 6 out-of-class notes, discussions and sample problems

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

COMPUTER ARCHTECTURE

Instructor Information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Multi-cycle Instructions in the Pipeline (Floating Point)

Recap: Machine Organization

Topics. Computer Organization CS Improving Performance. Opportunity for (Easy) Points. Three Generic Data Hazards

Memory Systems IRAM. Principle of IRAM

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

EE282H: Computer Architecture and Organization. EE282H: Computer Architecture and Organization -- Course Overview

Page 1. Memory Hierarchies (Part 2)

Computer Systems Architecture

EITF20: Computer Architecture Part4.1.1: Cache - 2

CS241 Computer Organization Spring Principle of Locality

Page # CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Michela Taufer

Fundamentals of Computer Systems

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

CS 426 Parallel Computing. Parallel Computing Platforms

Copyright 2012, Elsevier Inc. All rights reserved.

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS 614 COMPUTER ARCHITECTURE II FALL 2005

Final Lecture. A few minutes to wrap up and add some perspective

Computer Architecture A Quantitative Approach

EC 513 Computer Architecture

Processor (IV) - advanced ILP. Hwansoo Han

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

CPU Pipelining Issues

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Page 1. CISC 662 Graduate Computer Architecture. Lecture 8 - ILP 1. Pipeline CPI. Pipeline CPI (I) Pipeline CPI (II) Michela Taufer

EE 4683/5683: COMPUTER ARCHITECTURE

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Cycle Time for Non-pipelined & Pipelined processors

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Bridging the Processor/Memory Performance Gap in Database Applications

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise


CPU Architecture and Instruction Sets Chapter 1

ICE3003: Computer Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Hardware-based Speculation

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

Lecture 1 An Overview of High-Performance Computer Architecture. Automobile Factory (note: non-animated version)

CS3350B Computer Architecture. Introduction

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

DBMSs on a Modern Processor: Where Does Time Go? Revisited

Lecture 19: Instruction Level Parallelism

Keywords and Review Questions

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Out of Order Processing

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Advanced Memory Organizations

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CS3350B Computer Architecture

Computer Architecture. Introduction. Lynn Choi Korea University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Thomas Polzer Institut für Technische Informatik

CMSC411 Fall 2013 Midterm 1

Computer Architecture

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Transcription:

Database Systems and Modern CPU Architecture Prof. Dr. Torsten Grust Winter Term 2006/07

Hard Disk 2 RAM

Administrativa Lecture hours (@ MI HS 2): Monday, 09:15 10:00 Tuesday, 14:15 15:45 No lectures on Nov, 20 21, 2006 Tutorial/Lab (Jens Teubner, @ MW 1450): Thursday, 10:15 11:45 3

Administrativa Course homepage: http://www-db.in.tum.de/cms/teaching/ ws0607/mmdbms Contact: Torsten Grust grust@in.tum.de Jens Teubner jens.teubner@in.tum.de Rooms: 02.11.044, 02.11.042 (drop in if doors open) 4

Course Prerequisites These courses will be helpful in following the course but are not strictly (or even formally) required: 1. IN0004: Einführung in die Technische Informatik CPU architecture, assembly, memory hierarchy 2. IN0008: Grundlagen: Datenbanken Query processing, buffer mgmt, index structures 5

Assembly Language Here and there we will analyze snippets of (mostly MIPS-style) assembly language programs. LD R1,0(R2) ;Regs[R1] M[Regs[R2]+0] DSUB R4,R1,R5 ;Regs[R4] Regs[R1]-Regs[R5] AND R6,R1,R7 ;Regs[R6] Regs[R1]&Regs[R7] ORI R8,R1,255 ;Regs[R8] Regs[R1] 255 We will also look at Intel IA-32 and Itanium (IA-64). 6

Reading Material The CPU architecture and memory hierarchy aspects of this course are largely covered by Computer Architecture, 3rd ed A Quantitative Approach John L. Hennessy, David A. Patterson Morgan Kaufmann, 2003 (Chapters 1 5, Appendix A) 7

Reading Material Aspects of database technology are mainly discussed in a number of research papers. References will be given here, download the papers from the course homepage. (Helps to appreciate the details but not necessary to pass the exam.) 8

Tutorials & Assignments Tutorial sessions will try to be as hands-on as possible: - MonetDB - Mini programming exercises (language: C) - Try CPU performance and event counting, etc. We will hand out weekly assignments. There will be no grading but Jens will develop and discuss solutions with you. 9

Examination Examination (Klausur): Thursday, Feb 8, 2007 10:15 11:45 @ MW 1450 No formal requirements to take the exam (although it is highly advisable to actively work on the assignments). 10

Hard Disk RAM Today, it is perceivable to build database systems that primarily operate in main memory. In such systems, there is no central role for (disk) I/O management any longer. Instead, main memory database systems (MMDBMS) performance would be determined by other system components: the CPU and the memory hierarchy. 11

A Database in Primary Memory? Commodity hardware typically comes with primary memory sizes beyond 1 GB. Since the principle of locality applies to programs and data ( 90% of all database operations touch 10% of the data ), most database hot sets easily fit into RAM. Even further: The author of A Database in the CPU Cache might come to Garching and try to convince you that a DBMS needs a tiny fraction of RAM, only. 12

The Principle of Locality 1. Temporal Locality: Recently accessed items are likely to be addressed in the future. 2. Spatial Locality: Items whose addresses are near one another tend to referenced close together in time. Based on recent past, we can predict with reasonable accuracy which data will be touched (read/written) in the near future. 13

I/O Latency Dominates Everything 10 000 /min 14

Lack of I/O Latency...... promises fabulous performance figures for MMDBMS. MMDBMS, like MonetDB (CWI Amsterdam), indeed exhibit query performance improvements of two orders of magnitude over commercial disk-based DBMS. But! The DBMS internals need to be carefully engineered to realize this potential. 15

MonetDB: Binary Relations Only Designed as a relational MMDBMS from the ground up, many design decisions in MonetDB seem peculiar. All tables exactly have two columns (binary relations). These columns are named head (h) and tail (t). Most operators (e.g., select()) implicitly act on the head (tail) column of a table. 16

MonetDB: Binary Relations Only 17

MonetDB: Design Decisions Details of CPU and main-memory architecture drove the development of MonetDB: 1. The narrower the tuples, the more tuples will fit into a tiny fraction of RAM (e.g., the CPU cache). 2. Primitive operators spend less CPU cycles per tuple and behave in a predictable fashion. 18

CPU and Memory Performance Diverges Since 1986, CPU performance improved by a factor of 1.55/year (55%/year). DRAM (Dynamic RAM) access speed improves by about 7%/year. Modern CPUs spend larger and larger fractions time to wait for memory reads and writes to complete (memory latency). 19

The CPU Memory Speed Gap 20

Principle of Locality Comes to the Rescue Design a hierarchical memory system, based on memories of different speed and sizes. 21

Memory Access The New Bottleneck Memory access beyond the CPU cache is easily worth 100s of CPU instructions accessing disk-based memory accounts for 1 million instructions. Current and future hardware trends make this worse. If the DBMS needs to perform costly memory access, 1. make sure to use all data moved into the cache/cpu, 2. try to access memory in a predictable fashion (prefetching). 22

Instruction-Level Parallelism Modern CPUs e.g., Intel s Itanium 2 or Pentium 4 feature execution pipelines which ideally can complete 1 instruction per cycle (IPC): 1. Itanium 2 max 6 instructions execute in 7-stage pipeline: 6 7 = 42 instructions execute in parallel 2. Pentium 4 max 3 instructions execute in 31-stage pipeline: 3 31 = 93 instructions execute in parallel Such parallelism cannot always be found in (database) code. 23

Tracing MySQL In a simple SQL query like the following, MySQL will call a dedicated routine to perform the addition for each tuple individually: SELECT A + B FROM R The query engines first uses helper routines like rec_get_nth_field() to copy data in and out of MySQL s internal record representation. 24

Slow Addition in MySQL An inherent problem of the MySQL query engine is its one-tuple-at-a-time approach: foreach r R { s := Item_func_plus_val(r.A,r.B); } - Each invocation experiences its data dependencies in isolation no potential parallelism. 25

Tracing MySQL The addition itself, performed by routine Item_func_plus::val(), is found to take 50 CPU cycles: - Calling and returning from Item_func_plus::val() accounts for 30 CPU cycles. - Addition consumes the remaining CPU cycles. 26

Data Dependencies Trace was performed on MIPS R12000 CPU: - Can perform 3 ALU (arithmetic) and 1 load/store operation/cycle. Avg. instruction latency: 5 cycles. LD R1,<src1> LD R2,<src2> ADD R3,R2,R1 SD R3,<dst> ; R1 <src1> ; R2 <src2> ; R3 R1+R2 ; <dst> R3 data dependency 27

Loop Unrolling Unrolling the tuple-at-a-time loop and expanding the code for Item_func_plus::val() reveals that there is no data dependency between additions of different tuples: s[n] := r[n].a + r[n].b; s[n+1] := r[n+1].a + r[n+1].b; s[n+2] := r[n+2].a + r[n+2].b; 28

Instruction Scheduling Let the CPU or the compiler schedule dependent instructions such that instruction latency is hidden: LD R1,<src1> LD R2,<src2> ADD R3,R2,R1 LD R1,<src3> LD R2,<src4> ADD R4,R2,R1 LD R1,<src5> LD R2,<src6> SD R3,<dst1> ADD R3,R2,R1 LD R1,<src7> LD R2,<src8> SD R4,<dst2> 29 One addition completes every 3 4 CPU cycles.

Course Syllabus (1) Chapter 0: Introduction and Motivation Chapter 1: CPU Architecture and Instruction Sets - CPU performance, instruction set principles, RISC Chapter 2: Pipelining and Instruction-Level Parallelism (ILP) - CPU pipelines, data and control hazards, parallelism, instruction scheduling, branch prediction, super-scalar CPUs 30

Course Syllabus (2) Chapter 3: Database Systems: Where Does Time Go? (Part I) - CPU usage, stalls, and misprediction in DBMSs Chapter 4: How Database Systems Can Take Advantage of ILP - Vectorized processing, SIMD instructions, predictable code, compression [MonetDB, X100] 31

Course Syllabus (3) Chapter 5: The Memory Hierarchy (Close to the CPU) - Caches, (reducing) miss rate and penalty, loop reorganization, virtual memory, TLBs Chapter 6: Database Systems: Where Does Time Go? (Part 2) - Memory access behavior of database operators, impact of data layout 32

Course Syllabus (4) Chapter 7: How Database Systems Can Exploit the Memory Hierarchy - Data placement, column storage, database operation buffering, prefetching, compiler techniques [MonetDB, X100] 33