Query co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial

Size: px
Start display at page:

Download "Query co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial"

Transcription

1 Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance / Time Performance Intel Specint2000 Alpha Sparc Mips HP PA Pow er PC AMD Scalar Superscalar Out-of-order SMT Technology shift towards multi-cores [Multiple processors on the same chip] Year of introduction Focus of this tutorial DB workload execution on a modern computer Processor BUSY IDLE 100% execution time 80% 60% 40% 20% 0% Ideal seq. scan index scan DSS OLTP DBMS can run MUCH faster if they use available hardware efficiently

2 Detailed Seminar Outline INTRODUCTION AND OVERVIEW How computer architecture trends affect database workload behavior CPUs, NPUs, and GPUs: opportunity for architectural study DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns, bottlenecks, and current directions Architecture-conscious data management: limitations and opportunities QUERY co-processing: NETWORK PROCESSORS TLP and network processors Programming model Methodology & Results QUERY co-processing: GRAPHICS PROCESSORS Graphics Processor Overview Mapping Computation to GPUs Database and data mining applications CONCLUSIONS AND FUTURE DIRECTIONS Outline INTRODUCTION AND OVERVIEW How computer architecture trends affect DB workload behavior CPUs, NPUs, and GPUs: opportunity for architectural study DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Processor speed Scale # of transistors, innovative microarchitecture Moore s Law (despite technological hurdles!) Processor speed doubles every 18 months

3 Memory speed and capacity Memory capacity increases exponentially DRAM Fabrication primarily targets density Speed increases linearly DRAM size 4GB MB MB 16MB 10 4MB 1 1MB 64KB 256KB ACCESS TIME (µs) )sn( DEEPS DRAM SPEED TRENDS YEAR OF INTRODUCTION Larger but not as much faster memories Kbit 256 Kbit 1Mbit 4Mbit CYCLE TIME (ns) SLOWEST RAS (ns) FASTEST RAS (ns) CAS (ns) 16 Mbit 64 Mbit processor cycles / instruction The Memory Wall CPU(s) 0.1 Memory VAX/1980 PPro/ Trip to memory = thousands of instructions! 10 1 cycles / access to DRAM Memory hierarchies Caches trade off capacity for speed Exploit instruction/data locality Demand fetch/wait for data 1000 clk 100 clk 10 clk 1 clk C P U L1 64K [ADH99]: Running top 4 database systems At most 50% CPU utilization L3 L2 2M 32M But wait a minute Isn t I/O the bottleneck??? Memory 4GB 100G to 1TB

4 Modern storage managers Several decades work to hide I/O Asynchronous I/O + Prefetch & Postwrite Overlap I/O latency by useful computation Parallel data access Partition data on modern disk array [PAT88] Smart data placement / clustering Improve data locality Maximize parallelism Exploit hardware characteristics and DB storage larger main mgrs memories efficiently fit hide more I/O data 1MB in the 80 s, 10GB today, TBs coming soon latencies Why should we (databasers) care? 4 Cycles per instruction Database workloads under-utilize hardware Theoretical minimum Desktop/ Engineering (SPECInt) DB Decision Support (TPC-H) DB Online Transaction Processing (TPC-C) Near-future Multiprocessors Typical platforms: 1. Chips with multiple cores 2. Servers with multiple chips 3. Memory shared across Multiprocessor Server P P P P P P P P P P P P Memory access: Traverse multiple hierarchies Large non-uniform latencies Mem Mem Mem Programmer/Software must Hide/Tolerate Latency

5 Chip Multiprocessors are Here! AMD Opteron IBM Power 5 Intel Yonah 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? Towards coarse-grain parallelism Chip multiprocessors (CMPs) two or more complete processors on a single chip every functional unit of a processor is duplicated Simultaneous multithreading (SMT) store multiple contexts in different register set functional units multiplexed between threads instructions of different contexts simultaneously executed A chip multiprocessor CPU0 CPU1 CPU0 CPU1 L1I L1D L1I L1D L1I L1D L1I L1D L2 CACHE L2 CACHE CHIP 1 CHIP 2 L3 CACHE MAIN MEMORY

6 Chip Multi-Processors (CMP) Example: IBM Power4, Power5 Two cores Shared L2 Highly variable memory latency Speedup: OLTP 3x, DSS 2.3x on Piranha [BGM00] Simultaneous Multithreading (SMT) Implements threads in a superscalar processor Keeps hardware state for multiple threads E.g.: Intel Pentium 4 (SMT), IBM Power5 (SMT&CMP) 2 cores * 2 threads Speedup: OLTP 3x, DSS 0.5x (simulated) [LBE98] Multithreaded processors Simultaneous Multithreading: Provide several program counters registers (and usually several register sets) on chip Fast context switching to another thread of control Thread 1: Register set 1 PC PSR 1 Thread 2: Register set 2 PC PSR 2 Thread 3: Register set 3 FP PC PSR 3 CMP+SMT=immense parallelism opportunity Thread 4: Register set 4 PC PSR

7 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS execution time 100% At least 50% cycles on stalls Memory is major bottleneck Branch mispredictions increase cache misses! DB Execution Time Breakdown Computation Memory Branch mispred. Resource 80% 60% 40% 20% 0% seq. scan TPC-D index scan TPC-C [ADH99,BGB98,BGN00,KPH98] PII Xeon NT 4.0 Four DBMS: A, B, C, D Memory stall time (%) 100% DSS/OLTP basics: Memory 80% 60% 40% 20% 0% [ADH99,ADH01] PII Xeon running NT 4.0, used performance counters Four commercial Database Systems: A, B, C, D table scan A B C D DBMS Bottlenecks: data in L2, instructions in L1 Random access (OLTP): L1I-bound clustered index scan non-clustered index 100% 100% 100% 80% 60% 40% 20% 0% A B C D DBMS 80% 60% 40% 20% 80% 60% 40% 20% L1 Data L2 Data L1 Instruction L2 Instruction 0% B C D DBMS 0% Join (no index) A B C D DBMS

8 Why Not Increase L1I Size? [HA04] Problem: a larger cache is typically a slower cache Not a big problem for L2 L1I: in critical execution path slower L1I: slower clock Trends: Cache size 10 MB 1 MB 100 KB Max on-chip L2/L3 cache L1-I cache 10 KB Year Introduced L1I size is stable L2 size increase: Effect on performance? Increasing L2 Cache Size [BGB98,KPH98] DSS: Performance improves as L2 cache grows Not as clear a win for OLTP on multiprocessors Reduce cache size more capacity/conflict misses Increase cache size more coherence misses % of L2 cache misses to dirty data in another processor's cache 25% 20% 15% 10% 5% 0% 256KB 512KB 1MB 1P 2P 4P # of processors Larger L2: trade-off for OLTP Summary: Time breakdowns Database workloads: more than 50% stalls Mostly due to memory delays Cannot always reduce stalls by increasing cache size Crucial bottlenecks Data accesses to L2 cache (esp. for DSS) Instruction accesses to L1 cache (esp. for OLTP) Direction 1: Eliminate unnecessary misses Direction 2: Hide latency of cold misses

9 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Classic Data Layout on Disk Pages NSM (n-ary Storage Model, or Slotted Pages) # EID R Name Jane John Jim Susan Leon Dan Age PAGE HEADER HDR 1237 Jane 30 HDR 4322 John 45 HDR 1563 Jim 20 HDR 7658 Susan 52 Partial-record access Full-record access Records stored sequentially Attributes of a record stored together NSM in Memory Hierarchy select name from R where age > 50 PAGE HEADER 1237 Jane John Jim Susan 52 PAGE HEADER Jim Jane John Susan Jo Block 1 hn Block 2 Jim Block 3 DISK MAIN MEMORY CPU CACHE Optimized for full-record access Slows down partial-record access at all levels

10 Partition Attributes Across (PAX) NSM PAGE PAGE HEADER RH Jane 30 RH John 45 RH Jim 20 RH Susan 52 PAX PAGE PAGE HEADER Jane John Jim Susan [ADH01] mini page Idea: Partition data within page for spatial locality PAX in Memory Hierarchy [ADH01] select name from R where age > 50 PAGE HEADER PAGE HEADER block 1 Jane John Jim Susan Jane John Jim Susan cheap DISK MAIN MEMORY CPU CACHE Optimizes CPU cache-to-memory communication Retains NSM s I/O (page contents do not change) PAX Performance Results (Shore) stall cycles / record Cache data stalls L1 Data stalls L2 Data stalls NSM PAX page layout clock cycles per record Execution time breakdown 0 NSM PAX page layout Hardware Resource Branch Mispredict Memory Computation PII Xeon Windows NT4 16KB L1-I&D, 512 KB L2, 512 MB RAM Query: select avg (a i ) from R where a j >= Lo and a j <= Hi Validation with microbenchmarks: 70% less data stall time (only compulsory misses left) Better use of processor s superscalar capability TPC-H performance: 15%-2x speedup in queries Dynamic PAX: Data Morphing [HP03] [ADH01]

11 B-trees: < Pointers, > Fanout Cache Sensitive B + Trees (CSB + Trees) Layout child nodes contiguously Eliminate all but one child pointers Integer keys double fanout of nonleaf nodes B + Trees CSB + Trees [RR00] K 1 K 2 K 1 K 2 K 3 K 4 K 3 K 4 K 5 K 6 K 7 K 8 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 35% faster tree lookups Update performance is 30% worse (splits) Data Placement: Summary Smart data placement increases spatial locality Research targets table (relation) data Goal: Reduce number of non-cold cache misses Techniques focus grouping attributes into cache lines for quick access PAX, Data morphing Fates Automatically-tuned DB Storage Manager CSB+-trees Also, Fractured Mirrors: Cache-and-disk optimization [RDS02] with replication Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS

12 What about cold misses? Answer: hide latencies using prefetching Prefetching enabled by Non-blocking cache technology Prefetch assembly instructions SGI R10000, Alpha 21264, Intel Pentium4 pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9) CPU L1 Cache L2/L3 Cache Main Memory Prefetching hides cold cache miss latency Efficiently used in pointer-chasing lookups! Prefetching B + Trees [CGM01] (pb+ Trees) Idea: Larger nodes Node size = multiple cache lines (e.g. 8 lines) Later corroborated by [HP03a] Prefetch all lines of a node before searching it Cache miss Cost to access a node only increases slightly Much shallower trees, no changes required >2x better search AND update performance Approach complementary to CSB+ Trees! Time Prefetching B + Trees Goal: faster range scan [CGM01] Leaf parent nodes Leaf parent nodes contain addresses of all leaves Link leaf parent nodes together Use this structure for prefetching leaf nodes pb+ Trees: 8X speedup over B + Trees

13 Fractal Prefetching B+ Trees [CGM02] What if B+-tree does not fit in memory? (fpb+ Trees) Idea: Combine memory & disk trees Embed cache-optimized trees in disk tree nodes fpb+ Trees optimize both cache AND disk Key compression to increase fanout [BMR01] Compared to disk-based B + Trees, 80% faster inmemory searches with similar disk performance Bulk lookups [ZR03a] Optimize data cache performance Like computation regrouping [PMH02] Idea: increase temporal locality by delaying (buffering) node probes until a group is formed Example: NLJ probe stream: (r1, 10) (r2, 80) (r3, 15) RID key r110 B (r1, 10) root C RID key r1 10 B (r2, 80) C r2 80 (r3, 15) RID key r110 r315 B C r2 80 D E buffer (r2,80) is buffered B is accessed, (r1,10) is buffered before accessing C buffer entries are before accessing B divided among children 3x speedup with enough concurrency Hiding latencies: Summary Optimize B+ Tree pointer-chasing cache behavior Reduce node size to few cache lines Reduce pointers for larger fanout (CSB+) Next pointers to lowest non-leaf level for easy prefetching (pb+) Simultaneously optimize cache and disk (fpb+) Bulk searches: Buffer index accesses Additional work: CR-tree: Cache-conscious R-tree [KCK01] Compresses MBR keys Cache-oblivious B-Trees [BDF00] Optimal bound in number of memory transfers Regardless of # of memory levels, block size, or level speed Survey of techniques for B-Tree cache performance [GL01] Lots more to be done in area consider Existing heretofore-folkloric knowledge interference and scarce Key Mellon normalization/compression, The UNIVERSITY alignment, of separating NORTH CAROLINA keys/pointers at CHAPEL HILL

14 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Query Processing Algorithms Idea: Adapt query processing algorithms to caches Related work includes: Improving data cache performance Sorting Join Improving instruction cache performance DSS applications Query processing: directions Alphasort: use quicksort and key prefix-pointer Monet: MM-DBMS uses aggressive DSM Optimize partitioning with hierarchical radix-clustering Optimize post-projection with radix-declustering Many other optimizations Traditional hash joins: aggressive prefetching Efficiently hides data cache misses Robust performance with future long latencies Inspector Joins DSS I-misses: group computation (new operator) B-tree concurrency control: reduce readers latching

15 Instruction-Related Stalls 25-40% of execution time [KPH98, HA04] Importance of instruction cache: In the critical execution path! EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE L2 CACHE Impossible to overlap I-cache delays Call graph prefetching for DB apps [APD03] Goal: improve DSS I-cache performance Idea: Predict next function call using small cache Example: create_rec always calls find_, lock_, update_, and unlock_ page in same order Experiments: Shore on SimpleScalar Simulator Running Wisconsin Benchmark Beneficial for predictable DSS streams DB operators using SIMD SIMD: Single Instruction Multiple Data In modern CPUs, target multimedia apps Example: Pentium 4, 128-bit SIMD register holds four 32-bit values X3 X2 X1 X0 Y3 Y2 Y1 Y0 OP OP OP OP X3 OP Y3 X2 OP Y2 X1 OP Y1 X0 OP Y0 Assume data stored columnwise as contiguous array of fixed-length numeric values (e.g., PAX) Scan example: SIMD 1 st phase: if x[n] > 10 produce bitmap vector with 4 result[pos++] Superlinear = x[n] comparison speedup results to # of parallelism in parallel original scan code Need to rewrite code to use SIMD [ZR02] x[n+3] x[n+2] x[n+1] x[n] > > > >

16 STEPS: Cache-Resident OLTP Optimize OLTP instruction-cache performance Exploit high transaction concurrency Synchronized Transactions through Explicit Processor Scheduling: Multiplex concurrent transactions to exploit common code paths [HA04] Full-system TPC-C implementation: 65% fewer L1-I misses, before 40% speedup after thread A thread B thread A thread B code CPU CPU fits in instruction I-cache cache CPU executes code capacity CPU performs context-switch window context-switch point Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Evolution of hardware design Past: HW = CPU+Memory+Disk the 80s memory CPU today 1 cycle CPUs run faster than they access data

17 CMP, SMT, and memory Operation-level CPUparallelism memory affinity DBMS core design contradicts above goals Staged Database Systems [HA03] Staged software design allows for Cohort scheduling of queries to amortize loading time Suspend at module boundaries to maintain context Break DBMS into stages Stages act as independent servers Queries exist in the form of packets Proposed query scheduling algorithms to address locality/wait time tradeoffs [HA02] Staged Database Systems [HA03,HSA05] IN ISCAN AGGR OUT connect parser optimizer JOIN send results L1 L1 L2 L2 MEMORY MEMORY Optimize instruction/data cache locality Naturally enable multi-query processing Highly scalable, fault-tolerant, trustworthy FSCAN SORT

18 NUCA Memory hierarchy abstraction P P P P P P P P Private caches Large shared on-chip cache Data movement on CMP hierarchy Stencil computation OLTP Traditional DBMS: shared information in middle StagedDB: exposed data movement (data from Beckmann&Wood, Micro04) StagedCMP: StagedDB on Multicore µengines run independently on cores Dispatcher routes incoming tasks to cores Tradeoff: Work sharing at each uengine Load of each uengine CPU 1 CPU 2 Dispatcher Potential: better work sharing, load balance on CMP

19 Summary: Bridging the Gap Cache-aware data placement Eliminates unnecessary trips to memory Minimizes conflict/capacity misses Fates: decouple memory from storage layout What about compulsory (cold) misses? Can t avoid, but can hide latency with prefetching Techniques for B-trees, hash joins Addressing instruction stalls DSS: Call Graph Prefetching, SIMD, group operator OLTP: STEPS, a promising direction for any platform Staged Database Systems: a scalable future Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS TLP and network processors Programming model Methodology & Results QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Modern Architecture & DBMS Instruction-Level Parallelism (ILP) Out-of-order (OoO) execution window Cache hierarchies - spatial / temporal locality DBMS memory system characteristics Limited locality (e.g., sequential scan) Random access patterns (e.g., hash join) Pointer-chasing (e.g., index scan, hash join) DBMS needs Memory-Level Parallelism (MLP)

20 Opportunities on NPUs DB operators on thread-parallel architectures Expose parallel misses to memory Leverage intra-operator parallelism Evaluation using network processors Designed for packet processing Abundant thread-level parallelism (64+) Speedups of 2.0X-2.5X on common operators Early insight on multi-core, multi-threaded architectures and DBMS execution Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS TLP and network processors Programming model Methodology & Results QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS TLP in database operators Sequential or index scan Fetch attributes in parallel Thread A Thread B Hash join Probe tuples in parallel Name ID Grade Smith 0 B Chen 1 A Name ID Grade Smith 0 B Chen 1 A Thread B Thread A

21 Network Processors Host CPU PCI Controller DRAM Controller DRAM Scratch Memory 8 Cores Intel IXP cores, each with 8 thread contexts Dedicated DDR SDRAM (up to 1GB) < 20W power dissipation Multi-threaded Core Host CPU DRAM PCI Controller DRAM Controller Local Cache Register Files Scratch Memory 8 Cores IXP2400 Context 0 Active Context Simple processing core 5-stage, single-issue 600MHz 2.5KB local cache Waiting Contexts Core Switch contexts at programmer s discretion Programming Model ISA supports thread switching Wait for specific hardware signal 4 cycle context switch (non-preemptive) Software-managed memory access Instructions for DRAM, local, scratch memories Programmer controls data access No OS or virtual memory Sensible for simple, long-running code

22 Multithreading in Action Thread States: Active Waiting Ready T0 Thread Contexts (1 core) T1 T2 DRAM Request DRAM Data time T7 DRAM Latency Aggregate T0 T1 T2 T7 T0 T1 T2 Methodology IXP2400 on prototype PCI card 256MB PC2100 SDRAM Separated from host CPU Pentium 4 Xeon 2.8GHz 8KB L1D, 512KB L2, 4 pending misses 3GB PC2100 SDRAM Workloads TPC-H orders and lineitems tables (250MB) Sequential scan, hash join Sequential Scan Use slotted page layout (8KB) Page Record header Attributes n n Network processor implementation Each page scanned by threads on one core Overlap individual record access within core Slot array

23 Sequential Scan Performance cores 4 cores 2 cores 1 core Performance limited by DRAM controller # Threads/Core Hash Join Setup Model probe phase key attributes 1 hash Assign pages of outer relation to one core Each thread context issues one probe Overlap dependent accesses within core Hash Join Performance cores 4 cores 2 cores 1 core Network processor finds MLP across tuples # Threads/Core

24 Conclusions DBMS on modern processor Need to exploit memory parallelism Require thread-level parallelism Multi-threaded, multi-core architecture Hide memory latency Evaluation using network processors Simple hardware, lots of threads Highly programmable Beat Pentium 4 by 2X-2.5X on DB operators Implications In our work, we found Substantial intra-operator parallelism Promising architecture: network processor Other trends of interest High-bandwidth peripherals (PCI Express) Large DRAM banks on NP-based cards Direct connections from NP card to SATA disks Suggest off-loading as query co-processor Query Coprocessing with NPs Front-end Query compilation Results reporting CPU Back-end Query execution Raw data access Network Processors Use right 'tool' for job! Disk Arrays

25 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS Graphics Processor Overview Mapping Computation to GPUs Database and data mining applications CONCLUSIONS AND FUTURE DIRECTIONS Query Processing Architectures CPU (3 GHz) System Memory (2 GB) AGP Memory (512 MB) PCI-E Bus (4 GB/s) GPU (500 MHz) Video Memory (512 MB) GPU (500 MHz) Video Memory (512 MB) Conventional Using GPUs GPU Draw stream of triangles Visibility of triangles CPU Stream of vertices count of visible pixels Stream of visible pixels Vertex Processing Engines Stream of transformed vertices Setup Engine Setup of setup commands and state Pixel Processing Engines Alpha test Stencil test Depth test

26 GPU Draw stream of triangles Visibility of triangles CPU Stream of vertices Stream of visible pixels Vertex Processing Engines Floating Point (32-bit) Stream of transformed vertices Setup Engine Floating Point (32-bit) Setup of setup commands and state Pixel Processing Engines Alpha test Stencil test Depth test Floating Point (32-bit) Data Parallelism (G70) CPU vs. GPU Pentium EE GHz Dual Core 230M Transistors 2 x 1MB Cache Price: $ 1,040 GeForce 7800 GTX 430 MHz 302M Transistors 512MB onboard memory Price: $ 450

27 CPU vs. GPU (Henry Moreton: NVIDIA, Aug. 2005) PEE GTX GPU/CPU Graphics GFLOPs Shader GFLOPs Die Area (mm2) Die Area normalized Transistors (M) Power (W) GFLOPS/mm GFLOPS/tr GFLOPS/W Exploiting Technology Moving Faster than Moore s Law GPU Growth Rate CPU Growth Rate Graphics Processors Optimized for graphics applications Dedicated memory interface Offloads the CPU Provides peak IO performance

28 GPGP: GPU-based Algorithms Database computations Data streaming Scientific computations Collision and proximity computations Motion planning and navigation Physically-based modeling More at Graphics Pipeline vertex setup rasterizer polygon pixel texture image programmable vertex processing (fp32) polygon setup, culling, rasterization programmable perpixel math (fp32) per-pixel texture, fp16 blending Z-buf, fp16 blending, anti-alias (MRT) memory NON-Graphics Pipeline Courtesy: David Kirk, Chief Scientist, NVIDIA data setup rasterizer lists data data data programmable MIMD processing (fp32) SIMD rasterization programmable SIMD processing (fp32) data fetch, fp16 blending predicated write, fp16 blend, multiple output memory

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query

More information

Bridging the Processor/Memory Performance Gap in Database Applications

Bridging the Processor/Memory Performance Gap in Database Applications Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems Anastassia Ailamaki Ph.D. Examination November 30, 2000 A DBMS on a 1980 Computer DBMS Execution PROCESSOR 10 cycles/instruction DBMS Data and Instructions 6 cycles

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE

More information

Weaving Relations for Cache Performance

Weaving Relations for Cache Performance VLDB 2001, Rome, Italy Best Paper Award Weaving Relations for Cache Performance Anastassia Ailamaki David J. DeWitt Mark D. Hill Marios Skounakis Presented by: Ippokratis Pandis Bottleneck in DBMSs Processor

More information

Walking Four Machines by the Shore

Walking Four Machines by the Shore Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Anastasia Ailamaki. Performance and energy analysis using transactional workloads Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

STEPS Towards Cache-Resident Transaction Processing

STEPS Towards Cache-Resident Transaction Processing STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

SGI Challenge Overview

SGI Challenge Overview CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Query co-processing on Commodity Processors. Processor Performance over Time * Focus of this Seminar

Query co-processing on Commodity Processors. Processor Performance over Time * Focus of this Seminar Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance over Time

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Query co-processing on Commodity Processors

Query co-processing on Commodity Processors Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Stavros Harizopoulos MIT Processor

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Query co-processing on Commodity Processors. Processor Performance over Time *

Query co-processing on Commodity Processors. Processor Performance over Time * Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance over Time

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Data Page Layouts for Relational Databases on Deep Memory Hierarchies

Data Page Layouts for Relational Databases on Deep Memory Hierarchies Data Page Layouts for Relational Databases on Deep Memory Hierarchies Anastassia Ailamaki David J. DeWitt Mark D. Hill Carnegie Mellon University natassa@cmu.edu University of Wisconsin - Madison {dewitt,

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Main-Memory Databases 1 / 25

Main-Memory Databases 1 / 25 1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund

Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access Martin Grund Agenda Key Question: What is the memory hierarchy and how to exploit it? What to take home How is computer memory organized.

More information

Cache-Aware Database Systems Internals Chapter 7

Cache-Aware Database Systems Internals Chapter 7 Cache-Aware Database Systems Internals Chapter 7 1 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

I/O Buffering and Streaming

I/O Buffering and Streaming I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin

Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin Problem Statement TB disks are coming! Superwide, frequently sparse tables are common DB

More information

Cache-Aware Database Systems Internals. Chapter 7

Cache-Aware Database Systems Internals. Chapter 7 Cache-Aware Database Systems Internals Chapter 7 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Mo Money, No Problems: Caches #2...

Mo Money, No Problems: Caches #2... Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

ECE 574 Cluster Computing Lecture 16

ECE 574 Cluster Computing Lecture 16 ECE 574 Cluster Computing Lecture 16 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 26 March 2019 Announcements HW#7 posted HW#6 and HW#5 returned Don t forget project topics

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University

Computer Architecture. Memory Hierarchy. Lynn Choi Korea University Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

Keywords and Review Questions

Keywords and Review Questions Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

The NVIDIA GeForce 8800 GPU

The NVIDIA GeForce 8800 GPU The NVIDIA GeForce 8800 GPU August 2007 Erik Lindholm / Stuart Oberman Outline GeForce 8800 Architecture Overview Streaming Processor Array Streaming Multiprocessor Texture ROP: Raster Operation Pipeline

More information

CSE 544 Principles of Database Management Systems

CSE 544 Principles of Database Management Systems CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 5 - DBMS Architecture and Indexing 1 Announcements HW1 is due next Thursday How is it going? Projects: Proposals are due

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information