Query co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial

Size: px

Start display at page:

Download "Query co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial"

Gregory Gordon
6 years ago
Views:

1 Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance / Time Performance Intel Specint2000 Alpha Sparc Mips HP PA Pow er PC AMD Scalar Superscalar Out-of-order SMT Technology shift towards multi-cores [Multiple processors on the same chip] Year of introduction Focus of this tutorial DB workload execution on a modern computer Processor BUSY IDLE 100% execution time 80% 60% 40% 20% 0% Ideal seq. scan index scan DSS OLTP DBMS can run MUCH faster if they use available hardware efficiently

Detailed Seminar Outline INTRODUCTION AND OVERVIEW How computer architecture trends affect database workload behavior CPUs, NPUs, and GPUs: opportunity for architectural study DBs on CONVENTIONAL

2 Detailed Seminar Outline INTRODUCTION AND OVERVIEW How computer architecture trends affect database workload behavior CPUs, NPUs, and GPUs: opportunity for architectural study DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns, bottlenecks, and current directions Architecture-conscious data management: limitations and opportunities QUERY co-processing: NETWORK PROCESSORS TLP and network processors Programming model Methodology & Results QUERY co-processing: GRAPHICS PROCESSORS Graphics Processor Overview Mapping Computation to GPUs Database and data mining applications CONCLUSIONS AND FUTURE DIRECTIONS Outline INTRODUCTION AND OVERVIEW How computer architecture trends affect DB workload behavior CPUs, NPUs, and GPUs: opportunity for architectural study DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Processor speed Scale # of transistors, innovative microarchitecture Moore s Law (despite technological hurdles!) Processor speed doubles every 18 months

3 Memory speed and capacity Memory capacity increases exponentially DRAM Fabrication primarily targets density Speed increases linearly DRAM size 4GB MB MB 16MB 10 4MB 1 1MB 64KB 256KB ACCESS TIME (µs) )sn( DEEPS DRAM SPEED TRENDS YEAR OF INTRODUCTION Larger but not as much faster memories Kbit 256 Kbit 1Mbit 4Mbit CYCLE TIME (ns) SLOWEST RAS (ns) FASTEST RAS (ns) CAS (ns) 16 Mbit 64 Mbit processor cycles / instruction The Memory Wall CPU(s) 0.1 Memory VAX/1980 PPro/ Trip to memory = thousands of instructions! 10 1 cycles / access to DRAM Memory hierarchies Caches trade off capacity for speed Exploit instruction/data locality Demand fetch/wait for data 1000 clk 100 clk 10 clk 1 clk C P U L1 64K [ADH99]: Running top 4 database systems At most 50% CPU utilization L3 L2 2M 32M But wait a minute Isn t I/O the bottleneck??? Memory 4GB 100G to 1TB

4 Modern storage managers Several decades work to hide I/O Asynchronous I/O + Prefetch & Postwrite Overlap I/O latency by useful computation Parallel data access Partition data on modern disk array [PAT88] Smart data placement / clustering Improve data locality Maximize parallelism Exploit hardware characteristics and DB storage larger main mgrs memories efficiently fit hide more I/O data 1MB in the 80 s, 10GB today, TBs coming soon latencies Why should we (databasers) care? 4 Cycles per instruction Database workloads under-utilize hardware Theoretical minimum Desktop/ Engineering (SPECInt) DB Decision Support (TPC-H) DB Online Transaction Processing (TPC-C) Near-future Multiprocessors Typical platforms: 1. Chips with multiple cores 2. Servers with multiple chips 3. Memory shared across Multiprocessor Server P P P P P P P P P P P P Memory access: Traverse multiple hierarchies Large non-uniform latencies Mem Mem Mem Programmer/Software must Hide/Tolerate Latency

Towards coarse-grain parallelism Chip multiprocessors (CMPs) two or more complete processors on a single chip every functional unit of a processor is

5 Chip Multiprocessors are Here! AMD Opteron IBM Power 5 Intel Yonah 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? Towards coarse-grain parallelism Chip multiprocessors (CMPs) two or more complete processors on a single chip every functional unit of a processor is duplicated Simultaneous multithreading (SMT) store multiple contexts in different register set functional units multiplexed between threads instructions of different contexts simultaneously executed A chip multiprocessor CPU0 CPU1 CPU0 CPU1 L1I L1D L1I L1D L1I L1D L1I L1D L2 CACHE L2 CACHE CHIP 1 CHIP 2 L3 CACHE MAIN MEMORY

5x (simulated) [LBE98] Multithreaded processors Simultaneous Multithreading: Provide several program counters registers (and usually several register sets) on chip Fast context switching to

6 Chip Multi-Processors (CMP) Example: IBM Power4, Power5 Two cores Shared L2 Highly variable memory latency Speedup: OLTP 3x, DSS 2.3x on Piranha [BGM00] Simultaneous Multithreading (SMT) Implements threads in a superscalar processor Keeps hardware state for multiple threads E.g.: Intel Pentium 4 (SMT), IBM Power5 (SMT&CMP) 2 cores * 2 threads Speedup: OLTP 3x, DSS 0.5x (simulated) [LBE98] Multithreaded processors Simultaneous Multithreading: Provide several program counters registers (and usually several register sets) on chip Fast context switching to another thread of control Thread 1: Register set 1 PC PSR 1 Thread 2: Register set 2 PC PSR 2 Thread 3: Register set 3 FP PC PSR 3 CMP+SMT=immense parallelism opportunity Thread 4: Register set 4 PC PSR

7 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS execution time 100% At least 50% cycles on stalls Memory is major bottleneck Branch mispredictions increase cache misses! DB Execution Time Breakdown Computation Memory Branch mispred. Resource 80% 60% 40% 20% 0% seq. scan TPC-D index scan TPC-C [ADH99,BGB98,BGN00,KPH98] PII Xeon NT 4.0 Four DBMS: A, B, C, D Memory stall time (%) 100% DSS/OLTP basics: Memory 80% 60% 40% 20% 0% [ADH99,ADH01] PII Xeon running NT 4.0, used performance counters Four commercial Database Systems: A, B, C, D table scan A B C D DBMS Bottlenecks: data in L2, instructions in L1 Random access (OLTP): L1I-bound clustered index scan non-clustered index 100% 100% 100% 80% 60% 40% 20% 0% A B C D DBMS 80% 60% 40% 20% 80% 60% 40% 20% L1 Data L2 Data L1 Instruction L2 Instruction 0% B C D DBMS 0% Join (no index) A B C D DBMS

8 Why Not Increase L1I Size? [HA04] Problem: a larger cache is typically a slower cache Not a big problem for L2 L1I: in critical execution path slower L1I: slower clock Trends: Cache size 10 MB 1 MB 100 KB Max on-chip L2/L3 cache L1-I cache 10 KB Year Introduced L1I size is stable L2 size increase: Effect on performance? Increasing L2 Cache Size [BGB98,KPH98] DSS: Performance improves as L2 cache grows Not as clear a win for OLTP on multiprocessors Reduce cache size more capacity/conflict misses Increase cache size more coherence misses % of L2 cache misses to dirty data in another processor's cache 25% 20% 15% 10% 5% 0% 256KB 512KB 1MB 1P 2P 4P # of processors Larger L2: trade-off for OLTP Summary: Time breakdowns Database workloads: more than 50% stalls Mostly due to memory delays Cannot always reduce stalls by increasing cache size Crucial bottlenecks Data accesses to L2 cache (esp. for DSS) Instruction accesses to L1 cache (esp. for OLTP) Direction 1: Eliminate unnecessary misses Direction 2: Hide latency of cold misses

9 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Classic Data Layout on Disk Pages NSM (n-ary Storage Model, or Slotted Pages) # EID R Name Jane John Jim Susan Leon Dan Age PAGE HEADER HDR 1237 Jane 30 HDR 4322 John 45 HDR 1563 Jim 20 HDR 7658 Susan 52 Partial-record access Full-record access Records stored sequentially Attributes of a record stored together NSM in Memory Hierarchy select name from R where age > 50 PAGE HEADER 1237 Jane John Jim Susan 52 PAGE HEADER Jim Jane John Susan Jo Block 1 hn Block 2 Jim Block 3 DISK MAIN MEMORY CPU CACHE Optimized for full-record access Slows down partial-record access at all levels

10 Partition Attributes Across (PAX) NSM PAGE PAGE HEADER RH Jane 30 RH John 45 RH Jim 20 RH Susan 52 PAX PAGE PAGE HEADER Jane John Jim Susan [ADH01] mini page Idea: Partition data within page for spatial locality PAX in Memory Hierarchy [ADH01] select name from R where age > 50 PAGE HEADER PAGE HEADER block 1 Jane John Jim Susan Jane John Jim Susan cheap DISK MAIN MEMORY CPU CACHE Optimizes CPU cache-to-memory communication Retains NSM s I/O (page contents do not change) PAX Performance Results (Shore) stall cycles / record Cache data stalls L1 Data stalls L2 Data stalls NSM PAX page layout clock cycles per record Execution time breakdown 0 NSM PAX page layout Hardware Resource Branch Mispredict Memory Computation PII Xeon Windows NT4 16KB L1-I&D, 512 KB L2, 512 MB RAM Query: select avg (a i ) from R where a j >= Lo and a j <= Hi Validation with microbenchmarks: 70% less data stall time (only compulsory misses left) Better use of processor s superscalar capability TPC-H performance: 15%-2x speedup in queries Dynamic PAX: Data Morphing [HP03] [ADH01]

11 B-trees: < Pointers, > Fanout Cache Sensitive B + Trees (CSB + Trees) Layout child nodes contiguously Eliminate all but one child pointers Integer keys double fanout of nonleaf nodes B + Trees CSB + Trees [RR00] K 1 K 2 K 1 K 2 K 3 K 4 K 3 K 4 K 5 K 6 K 7 K 8 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 35% faster tree lookups Update performance is 30% worse (splits) Data Placement: Summary Smart data placement increases spatial locality Research targets table (relation) data Goal: Reduce number of non-cold cache misses Techniques focus grouping attributes into cache lines for quick access PAX, Data morphing Fates Automatically-tuned DB Storage Manager CSB+-trees Also, Fractured Mirrors: Cache-and-disk optimization [RDS02] with replication Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS

12 What about cold misses? Answer: hide latencies using prefetching Prefetching enabled by Non-blocking cache technology Prefetch assembly instructions SGI R10000, Alpha 21264, Intel Pentium4 pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9) CPU L1 Cache L2/L3 Cache Main Memory Prefetching hides cold cache miss latency Efficiently used in pointer-chasing lookups! Prefetching B + Trees [CGM01] (pb+ Trees) Idea: Larger nodes Node size = multiple cache lines (e.g. 8 lines) Later corroborated by [HP03a] Prefetch all lines of a node before searching it Cache miss Cost to access a node only increases slightly Much shallower trees, no changes required >2x better search AND update performance Approach complementary to CSB+ Trees! Time Prefetching B + Trees Goal: faster range scan [CGM01] Leaf parent nodes Leaf parent nodes contain addresses of all leaves Link leaf parent nodes together Use this structure for prefetching leaf nodes pb+ Trees: 8X speedup over B + Trees

13 Fractal Prefetching B+ Trees [CGM02] What if B+-tree does not fit in memory? (fpb+ Trees) Idea: Combine memory & disk trees Embed cache-optimized trees in disk tree nodes fpb+ Trees optimize both cache AND disk Key compression to increase fanout [BMR01] Compared to disk-based B + Trees, 80% faster inmemory searches with similar disk performance Bulk lookups [ZR03a] Optimize data cache performance Like computation regrouping [PMH02] Idea: increase temporal locality by delaying (buffering) node probes until a group is formed Example: NLJ probe stream: (r1, 10) (r2, 80) (r3, 15) RID key r110 B (r1, 10) root C RID key r1 10 B (r2, 80) C r2 80 (r3, 15) RID key r110 r315 B C r2 80 D E buffer (r2,80) is buffered B is accessed, (r1,10) is buffered before accessing C buffer entries are before accessing B divided among children 3x speedup with enough concurrency Hiding latencies: Summary Optimize B+ Tree pointer-chasing cache behavior Reduce node size to few cache lines Reduce pointers for larger fanout (CSB+) Next pointers to lowest non-leaf level for easy prefetching (pb+) Simultaneously optimize cache and disk (fpb+) Bulk searches: Buffer index accesses Additional work: CR-tree: Cache-conscious R-tree [KCK01] Compresses MBR keys Cache-oblivious B-Trees [BDF00] Optimal bound in number of memory transfers Regardless of # of memory levels, block size, or level speed Survey of techniques for B-Tree cache performance [GL01] Lots more to be done in area consider Existing heretofore-folkloric knowledge interference and scarce Key Mellon normalization/compression, The UNIVERSITY alignment, of separating NORTH CAROLINA keys/pointers at CHAPEL HILL

14 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Query Processing Algorithms Idea: Adapt query processing algorithms to caches Related work includes: Improving data cache performance Sorting Join Improving instruction cache performance DSS applications Query processing: directions Alphasort: use quicksort and key prefix-pointer Monet: MM-DBMS uses aggressive DSM Optimize partitioning with hierarchical radix-clustering Optimize post-projection with radix-declustering Many other optimizations Traditional hash joins: aggressive prefetching Efficiently hides data cache misses Robust performance with future long latencies Inspector Joins DSS I-misses: group computation (new operator) B-tree concurrency control: reduce readers latching

15 Instruction-Related Stalls 25-40% of execution time [KPH98, HA04] Importance of instruction cache: In the critical execution path! EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE L2 CACHE Impossible to overlap I-cache delays Call graph prefetching for DB apps [APD03] Goal: improve DSS I-cache performance Idea: Predict next function call using small cache Example: create_rec always calls find_, lock_, update_, and unlock_ page in same order Experiments: Shore on SimpleScalar Simulator Running Wisconsin Benchmark Beneficial for predictable DSS streams DB operators using SIMD SIMD: Single Instruction Multiple Data In modern CPUs, target multimedia apps Example: Pentium 4, 128-bit SIMD register holds four 32-bit values X3 X2 X1 X0 Y3 Y2 Y1 Y0 OP OP OP OP X3 OP Y3 X2 OP Y2 X1 OP Y1 X0 OP Y0 Assume data stored columnwise as contiguous array of fixed-length numeric values (e.g., PAX) Scan example: SIMD 1 st phase: if x[n] > 10 produce bitmap vector with 4 result[pos++] Superlinear = x[n] comparison speedup results to # of parallelism in parallel original scan code Need to rewrite code to use SIMD [ZR02] x[n+3] x[n+2] x[n+1] x[n] > > > >

16 STEPS: Cache-Resident OLTP Optimize OLTP instruction-cache performance Exploit high transaction concurrency Synchronized Transactions through Explicit Processor Scheduling: Multiplex concurrent transactions to exploit common code paths [HA04] Full-system TPC-C implementation: 65% fewer L1-I misses, before 40% speedup after thread A thread B thread A thread B code CPU CPU fits in instruction I-cache cache CPU executes code capacity CPU performs context-switch window context-switch point Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Evolution of hardware design Past: HW = CPU+Memory+Disk the 80s memory CPU today 1 cycle CPUs run faster than they access data

17 CMP, SMT, and memory Operation-level CPUparallelism memory affinity DBMS core design contradicts above goals Staged Database Systems [HA03] Staged software design allows for Cohort scheduling of queries to amortize loading time Suspend at module boundaries to maintain context Break DBMS into stages Stages act as independent servers Queries exist in the form of packets Proposed query scheduling algorithms to address locality/wait time tradeoffs [HA02] Staged Database Systems [HA03,HSA05] IN ISCAN AGGR OUT connect parser optimizer JOIN send results L1 L1 L2 L2 MEMORY MEMORY Optimize instruction/data cache locality Naturally enable multi-query processing Highly scalable, fault-tolerant, trustworthy FSCAN SORT

Beckmann&Wood, Micro04) StagedCMP: StagedDB on Multicore µengines run independently on cores Dispatcher routes incoming tasks

18 NUCA Memory hierarchy abstraction P P P P P P P P Private caches Large shared on-chip cache Data movement on CMP hierarchy Stencil computation OLTP Traditional DBMS: shared information in middle StagedDB: exposed data movement (data from Beckmann&Wood, Micro04) StagedCMP: StagedDB on Multicore µengines run independently on cores Dispatcher routes incoming tasks to cores Tradeoff: Work sharing at each uengine Load of each uengine CPU 1 CPU 2 Dispatcher Potential: better work sharing, load balance on CMP

19 Summary: Bridging the Gap Cache-aware data placement Eliminates unnecessary trips to memory Minimizes conflict/capacity misses Fates: decouple memory from storage layout What about compulsory (cold) misses? Can t avoid, but can hide latency with prefetching Techniques for B-trees, hash joins Addressing instruction stalls DSS: Call Graph Prefetching, SIMD, group operator OLTP: STEPS, a promising direction for any platform Staged Database Systems: a scalable future Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS TLP and network processors Programming model Methodology & Results QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Modern Architecture & DBMS Instruction-Level Parallelism (ILP) Out-of-order (OoO) execution window Cache hierarchies - spatial / temporal locality DBMS memory system characteristics Limited locality (e.g., sequential scan) Random access patterns (e.g., hash join) Pointer-chasing (e.g., index scan, hash join) DBMS needs Memory-Level Parallelism (MLP)

20 Opportunities on NPUs DB operators on thread-parallel architectures Expose parallel misses to memory Leverage intra-operator parallelism Evaluation using network processors Designed for packet processing Abundant thread-level parallelism (64+) Speedups of 2.0X-2.5X on common operators Early insight on multi-core, multi-threaded architectures and DBMS execution Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS TLP and network processors Programming model Methodology & Results QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS TLP in database operators Sequential or index scan Fetch attributes in parallel Thread A Thread B Hash join Probe tuples in parallel Name ID Grade Smith 0 B Chen 1 A Name ID Grade Smith 0 B Chen 1 A Thread B Thread A

21 Network Processors Host CPU PCI Controller DRAM Controller DRAM Scratch Memory 8 Cores Intel IXP cores, each with 8 thread contexts Dedicated DDR SDRAM (up to 1GB) < 20W power dissipation Multi-threaded Core Host CPU DRAM PCI Controller DRAM Controller Local Cache Register Files Scratch Memory 8 Cores IXP2400 Context 0 Active Context Simple processing core 5-stage, single-issue 600MHz 2.5KB local cache Waiting Contexts Core Switch contexts at programmer s discretion Programming Model ISA supports thread switching Wait for specific hardware signal 4 cycle context switch (non-preemptive) Software-managed memory access Instructions for DRAM, local, scratch memories Programmer controls data access No OS or virtual memory Sensible for simple, long-running code

22 Multithreading in Action Thread States: Active Waiting Ready T0 Thread Contexts (1 core) T1 T2 DRAM Request DRAM Data time T7 DRAM Latency Aggregate T0 T1 T2 T7 T0 T1 T2 Methodology IXP2400 on prototype PCI card 256MB PC2100 SDRAM Separated from host CPU Pentium 4 Xeon 2.8GHz 8KB L1D, 512KB L2, 4 pending misses 3GB PC2100 SDRAM Workloads TPC-H orders and lineitems tables (250MB) Sequential scan, hash join Sequential Scan Use slotted page layout (8KB) Page Record header Attributes n n Network processor implementation Each page scanned by threads on one core Overlap individual record access within core Slot array

23 Sequential Scan Performance cores 4 cores 2 cores 1 core Performance limited by DRAM controller # Threads/Core Hash Join Setup Model probe phase key attributes 1 hash Assign pages of outer relation to one core Each thread context issues one probe Overlap dependent accesses within core Hash Join Performance cores 4 cores 2 cores 1 core Network processor finds MLP across tuples # Threads/Core

Conclusions DBMS on modern processor Need to exploit memory parallelism Require thread-level parallelism Multi-threaded, multi-core architecture Hide memory latency Evaluation using network

24 Conclusions DBMS on modern processor Need to exploit memory parallelism Require thread-level parallelism Multi-threaded, multi-core architecture Hide memory latency Evaluation using network processors Simple hardware, lots of threads Highly programmable Beat Pentium 4 by 2X-2.5X on DB operators Implications In our work, we found Substantial intra-operator parallelism Promising architecture: network processor Other trends of interest High-bandwidth peripherals (PCI Express) Large DRAM banks on NP-based cards Direct connections from NP card to SATA disks Suggest off-loading as query co-processor Query Coprocessing with NPs Front-end Query compilation Results reporting CPU Back-end Query execution Raw data access Network Processors Use right 'tool' for job! Disk Arrays

25 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS Graphics Processor Overview Mapping Computation to GPUs Database and data mining applications CONCLUSIONS AND FUTURE DIRECTIONS Query Processing Architectures CPU (3 GHz) System Memory (2 GB) AGP Memory (512 MB) PCI-E Bus (4 GB/s) GPU (500 MHz) Video Memory (512 MB) GPU (500 MHz) Video Memory (512 MB) Conventional Using GPUs GPU Draw stream of triangles Visibility of triangles CPU Stream of vertices count of visible pixels Stream of visible pixels Vertex Processing Engines Stream of transformed vertices Setup Engine Setup of setup commands and state Pixel Processing Engines Alpha test Stencil test Depth test

GPU Draw stream of triangles Visibility of triangles CPU Stream of vertices Stream of visible pixels Vertex Processing Engines Floating Point (32-bit) Stream of transformed vertices Setup Engine

26 GPU Draw stream of triangles Visibility of triangles CPU Stream of vertices Stream of visible pixels Vertex Processing Engines Floating Point (32-bit) Stream of transformed vertices Setup Engine Floating Point (32-bit) Setup of setup commands and state Pixel Processing Engines Alpha test Stencil test Depth test Floating Point (32-bit) Data Parallelism (G70) CPU vs. GPU Pentium EE GHz Dual Core 230M Transistors 2 x 1MB Cache Price: $ 1,040 GeForce 7800 GTX 430 MHz 302M Transistors 512MB onboard memory Price: $ 450

27 CPU vs. GPU (Henry Moreton: NVIDIA, Aug. 2005) PEE GTX GPU/CPU Graphics GFLOPs Shader GFLOPs Die Area (mm2) Die Area normalized Transistors (M) Power (W) GFLOPS/mm GFLOPS/tr GFLOPS/W Exploiting Technology Moving Faster than Moore s Law GPU Growth Rate CPU Growth Rate Graphics Processors Optimized for graphics applications Dedicated memory interface Offloads the CPU Provides peak IO performance

28 GPGP: GPU-based Algorithms Database computations Data streaming Scientific computations Collision and proximity computations Motion planning and navigation Physically-based modeling More at Graphics Pipeline vertex setup rasterizer polygon pixel texture image programmable vertex processing (fp32) polygon setup, culling, rasterization programmable perpixel math (fp32) per-pixel texture, fp16 blending Z-buf, fp16 blending, anti-alias (MRT) memory NON-Graphics Pipeline Courtesy: David Kirk, Chief Scientist, NVIDIA data setup rasterizer lists data data data programmable MIMD processing (fp32) SIMD rasterization programmable SIMD processing (fp32) data fetch, fp16 blending predicated write, fp16 blend, multiple output memory

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query