Query co-processing on Commodity Processors. Processor Performance / Time. Focus of this tutorial
|
|
- Gregory Gordon
- 6 years ago
- Views:
Transcription
1 Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance / Time Performance Intel Specint2000 Alpha Sparc Mips HP PA Pow er PC AMD Scalar Superscalar Out-of-order SMT Technology shift towards multi-cores [Multiple processors on the same chip] Year of introduction Focus of this tutorial DB workload execution on a modern computer Processor BUSY IDLE 100% execution time 80% 60% 40% 20% 0% Ideal seq. scan index scan DSS OLTP DBMS can run MUCH faster if they use available hardware efficiently
2 Detailed Seminar Outline INTRODUCTION AND OVERVIEW How computer architecture trends affect database workload behavior CPUs, NPUs, and GPUs: opportunity for architectural study DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns, bottlenecks, and current directions Architecture-conscious data management: limitations and opportunities QUERY co-processing: NETWORK PROCESSORS TLP and network processors Programming model Methodology & Results QUERY co-processing: GRAPHICS PROCESSORS Graphics Processor Overview Mapping Computation to GPUs Database and data mining applications CONCLUSIONS AND FUTURE DIRECTIONS Outline INTRODUCTION AND OVERVIEW How computer architecture trends affect DB workload behavior CPUs, NPUs, and GPUs: opportunity for architectural study DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Processor speed Scale # of transistors, innovative microarchitecture Moore s Law (despite technological hurdles!) Processor speed doubles every 18 months
3 Memory speed and capacity Memory capacity increases exponentially DRAM Fabrication primarily targets density Speed increases linearly DRAM size 4GB MB MB 16MB 10 4MB 1 1MB 64KB 256KB ACCESS TIME (µs) )sn( DEEPS DRAM SPEED TRENDS YEAR OF INTRODUCTION Larger but not as much faster memories Kbit 256 Kbit 1Mbit 4Mbit CYCLE TIME (ns) SLOWEST RAS (ns) FASTEST RAS (ns) CAS (ns) 16 Mbit 64 Mbit processor cycles / instruction The Memory Wall CPU(s) 0.1 Memory VAX/1980 PPro/ Trip to memory = thousands of instructions! 10 1 cycles / access to DRAM Memory hierarchies Caches trade off capacity for speed Exploit instruction/data locality Demand fetch/wait for data 1000 clk 100 clk 10 clk 1 clk C P U L1 64K [ADH99]: Running top 4 database systems At most 50% CPU utilization L3 L2 2M 32M But wait a minute Isn t I/O the bottleneck??? Memory 4GB 100G to 1TB
4 Modern storage managers Several decades work to hide I/O Asynchronous I/O + Prefetch & Postwrite Overlap I/O latency by useful computation Parallel data access Partition data on modern disk array [PAT88] Smart data placement / clustering Improve data locality Maximize parallelism Exploit hardware characteristics and DB storage larger main mgrs memories efficiently fit hide more I/O data 1MB in the 80 s, 10GB today, TBs coming soon latencies Why should we (databasers) care? 4 Cycles per instruction Database workloads under-utilize hardware Theoretical minimum Desktop/ Engineering (SPECInt) DB Decision Support (TPC-H) DB Online Transaction Processing (TPC-C) Near-future Multiprocessors Typical platforms: 1. Chips with multiple cores 2. Servers with multiple chips 3. Memory shared across Multiprocessor Server P P P P P P P P P P P P Memory access: Traverse multiple hierarchies Large non-uniform latencies Mem Mem Mem Programmer/Software must Hide/Tolerate Latency
5 Chip Multiprocessors are Here! AMD Opteron IBM Power 5 Intel Yonah 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? Towards coarse-grain parallelism Chip multiprocessors (CMPs) two or more complete processors on a single chip every functional unit of a processor is duplicated Simultaneous multithreading (SMT) store multiple contexts in different register set functional units multiplexed between threads instructions of different contexts simultaneously executed A chip multiprocessor CPU0 CPU1 CPU0 CPU1 L1I L1D L1I L1D L1I L1D L1I L1D L2 CACHE L2 CACHE CHIP 1 CHIP 2 L3 CACHE MAIN MEMORY
6 Chip Multi-Processors (CMP) Example: IBM Power4, Power5 Two cores Shared L2 Highly variable memory latency Speedup: OLTP 3x, DSS 2.3x on Piranha [BGM00] Simultaneous Multithreading (SMT) Implements threads in a superscalar processor Keeps hardware state for multiple threads E.g.: Intel Pentium 4 (SMT), IBM Power5 (SMT&CMP) 2 cores * 2 threads Speedup: OLTP 3x, DSS 0.5x (simulated) [LBE98] Multithreaded processors Simultaneous Multithreading: Provide several program counters registers (and usually several register sets) on chip Fast context switching to another thread of control Thread 1: Register set 1 PC PSR 1 Thread 2: Register set 2 PC PSR 2 Thread 3: Register set 3 FP PC PSR 3 CMP+SMT=immense parallelism opportunity Thread 4: Register set 4 PC PSR
7 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS execution time 100% At least 50% cycles on stalls Memory is major bottleneck Branch mispredictions increase cache misses! DB Execution Time Breakdown Computation Memory Branch mispred. Resource 80% 60% 40% 20% 0% seq. scan TPC-D index scan TPC-C [ADH99,BGB98,BGN00,KPH98] PII Xeon NT 4.0 Four DBMS: A, B, C, D Memory stall time (%) 100% DSS/OLTP basics: Memory 80% 60% 40% 20% 0% [ADH99,ADH01] PII Xeon running NT 4.0, used performance counters Four commercial Database Systems: A, B, C, D table scan A B C D DBMS Bottlenecks: data in L2, instructions in L1 Random access (OLTP): L1I-bound clustered index scan non-clustered index 100% 100% 100% 80% 60% 40% 20% 0% A B C D DBMS 80% 60% 40% 20% 80% 60% 40% 20% L1 Data L2 Data L1 Instruction L2 Instruction 0% B C D DBMS 0% Join (no index) A B C D DBMS
8 Why Not Increase L1I Size? [HA04] Problem: a larger cache is typically a slower cache Not a big problem for L2 L1I: in critical execution path slower L1I: slower clock Trends: Cache size 10 MB 1 MB 100 KB Max on-chip L2/L3 cache L1-I cache 10 KB Year Introduced L1I size is stable L2 size increase: Effect on performance? Increasing L2 Cache Size [BGB98,KPH98] DSS: Performance improves as L2 cache grows Not as clear a win for OLTP on multiprocessors Reduce cache size more capacity/conflict misses Increase cache size more coherence misses % of L2 cache misses to dirty data in another processor's cache 25% 20% 15% 10% 5% 0% 256KB 512KB 1MB 1P 2P 4P # of processors Larger L2: trade-off for OLTP Summary: Time breakdowns Database workloads: more than 50% stalls Mostly due to memory delays Cannot always reduce stalls by increasing cache size Crucial bottlenecks Data accesses to L2 cache (esp. for DSS) Instruction accesses to L1 cache (esp. for OLTP) Direction 1: Eliminate unnecessary misses Direction 2: Hide latency of cold misses
9 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Classic Data Layout on Disk Pages NSM (n-ary Storage Model, or Slotted Pages) # EID R Name Jane John Jim Susan Leon Dan Age PAGE HEADER HDR 1237 Jane 30 HDR 4322 John 45 HDR 1563 Jim 20 HDR 7658 Susan 52 Partial-record access Full-record access Records stored sequentially Attributes of a record stored together NSM in Memory Hierarchy select name from R where age > 50 PAGE HEADER 1237 Jane John Jim Susan 52 PAGE HEADER Jim Jane John Susan Jo Block 1 hn Block 2 Jim Block 3 DISK MAIN MEMORY CPU CACHE Optimized for full-record access Slows down partial-record access at all levels
10 Partition Attributes Across (PAX) NSM PAGE PAGE HEADER RH Jane 30 RH John 45 RH Jim 20 RH Susan 52 PAX PAGE PAGE HEADER Jane John Jim Susan [ADH01] mini page Idea: Partition data within page for spatial locality PAX in Memory Hierarchy [ADH01] select name from R where age > 50 PAGE HEADER PAGE HEADER block 1 Jane John Jim Susan Jane John Jim Susan cheap DISK MAIN MEMORY CPU CACHE Optimizes CPU cache-to-memory communication Retains NSM s I/O (page contents do not change) PAX Performance Results (Shore) stall cycles / record Cache data stalls L1 Data stalls L2 Data stalls NSM PAX page layout clock cycles per record Execution time breakdown 0 NSM PAX page layout Hardware Resource Branch Mispredict Memory Computation PII Xeon Windows NT4 16KB L1-I&D, 512 KB L2, 512 MB RAM Query: select avg (a i ) from R where a j >= Lo and a j <= Hi Validation with microbenchmarks: 70% less data stall time (only compulsory misses left) Better use of processor s superscalar capability TPC-H performance: 15%-2x speedup in queries Dynamic PAX: Data Morphing [HP03] [ADH01]
11 B-trees: < Pointers, > Fanout Cache Sensitive B + Trees (CSB + Trees) Layout child nodes contiguously Eliminate all but one child pointers Integer keys double fanout of nonleaf nodes B + Trees CSB + Trees [RR00] K 1 K 2 K 1 K 2 K 3 K 4 K 3 K 4 K 5 K 6 K 7 K 8 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 K 1 K 2 K 3 K 4 35% faster tree lookups Update performance is 30% worse (splits) Data Placement: Summary Smart data placement increases spatial locality Research targets table (relation) data Goal: Reduce number of non-cold cache misses Techniques focus grouping attributes into cache lines for quick access PAX, Data morphing Fates Automatically-tuned DB Storage Manager CSB+-trees Also, Fractured Mirrors: Cache-and-disk optimization [RDS02] with replication Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS
12 What about cold misses? Answer: hide latencies using prefetching Prefetching enabled by Non-blocking cache technology Prefetch assembly instructions SGI R10000, Alpha 21264, Intel Pentium4 pref 0(r2) pref 4(r7) pref 0(r3) pref 8(r9) CPU L1 Cache L2/L3 Cache Main Memory Prefetching hides cold cache miss latency Efficiently used in pointer-chasing lookups! Prefetching B + Trees [CGM01] (pb+ Trees) Idea: Larger nodes Node size = multiple cache lines (e.g. 8 lines) Later corroborated by [HP03a] Prefetch all lines of a node before searching it Cache miss Cost to access a node only increases slightly Much shallower trees, no changes required >2x better search AND update performance Approach complementary to CSB+ Trees! Time Prefetching B + Trees Goal: faster range scan [CGM01] Leaf parent nodes Leaf parent nodes contain addresses of all leaves Link leaf parent nodes together Use this structure for prefetching leaf nodes pb+ Trees: 8X speedup over B + Trees
13 Fractal Prefetching B+ Trees [CGM02] What if B+-tree does not fit in memory? (fpb+ Trees) Idea: Combine memory & disk trees Embed cache-optimized trees in disk tree nodes fpb+ Trees optimize both cache AND disk Key compression to increase fanout [BMR01] Compared to disk-based B + Trees, 80% faster inmemory searches with similar disk performance Bulk lookups [ZR03a] Optimize data cache performance Like computation regrouping [PMH02] Idea: increase temporal locality by delaying (buffering) node probes until a group is formed Example: NLJ probe stream: (r1, 10) (r2, 80) (r3, 15) RID key r110 B (r1, 10) root C RID key r1 10 B (r2, 80) C r2 80 (r3, 15) RID key r110 r315 B C r2 80 D E buffer (r2,80) is buffered B is accessed, (r1,10) is buffered before accessing C buffer entries are before accessing B divided among children 3x speedup with enough concurrency Hiding latencies: Summary Optimize B+ Tree pointer-chasing cache behavior Reduce node size to few cache lines Reduce pointers for larger fanout (CSB+) Next pointers to lowest non-leaf level for easy prefetching (pb+) Simultaneously optimize cache and disk (fpb+) Bulk searches: Buffer index accesses Additional work: CR-tree: Cache-conscious R-tree [KCK01] Compresses MBR keys Cache-oblivious B-Trees [BDF00] Optimal bound in number of memory transfers Regardless of # of memory levels, block size, or level speed Survey of techniques for B-Tree cache performance [GL01] Lots more to be done in area consider Existing heretofore-folkloric knowledge interference and scarce Key Mellon normalization/compression, The UNIVERSITY alignment, of separating NORTH CAROLINA keys/pointers at CHAPEL HILL
14 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Query Processing Algorithms Idea: Adapt query processing algorithms to caches Related work includes: Improving data cache performance Sorting Join Improving instruction cache performance DSS applications Query processing: directions Alphasort: use quicksort and key prefix-pointer Monet: MM-DBMS uses aggressive DSM Optimize partitioning with hierarchical radix-clustering Optimize post-projection with radix-declustering Many other optimizations Traditional hash joins: aggressive prefetching Efficiently hides data cache misses Robust performance with future long latencies Inspector Joins DSS I-misses: group computation (new operator) B-tree concurrency control: reduce readers latching
15 Instruction-Related Stalls 25-40% of execution time [KPH98, HA04] Importance of instruction cache: In the critical execution path! EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE L2 CACHE Impossible to overlap I-cache delays Call graph prefetching for DB apps [APD03] Goal: improve DSS I-cache performance Idea: Predict next function call using small cache Example: create_rec always calls find_, lock_, update_, and unlock_ page in same order Experiments: Shore on SimpleScalar Simulator Running Wisconsin Benchmark Beneficial for predictable DSS streams DB operators using SIMD SIMD: Single Instruction Multiple Data In modern CPUs, target multimedia apps Example: Pentium 4, 128-bit SIMD register holds four 32-bit values X3 X2 X1 X0 Y3 Y2 Y1 Y0 OP OP OP OP X3 OP Y3 X2 OP Y2 X1 OP Y1 X0 OP Y0 Assume data stored columnwise as contiguous array of fixed-length numeric values (e.g., PAX) Scan example: SIMD 1 st phase: if x[n] > 10 produce bitmap vector with 4 result[pos++] Superlinear = x[n] comparison speedup results to # of parallelism in parallel original scan code Need to rewrite code to use SIMD [ZR02] x[n+3] x[n+2] x[n+1] x[n] > > > >
16 STEPS: Cache-Resident OLTP Optimize OLTP instruction-cache performance Exploit high transaction concurrency Synchronized Transactions through Explicit Processor Scheduling: Multiplex concurrent transactions to exploit common code paths [HA04] Full-system TPC-C implementation: 65% fewer L1-I misses, before 40% speedup after thread A thread B thread A thread B code CPU CPU fits in instruction I-cache cache CPU executes code capacity CPU performs context-switch window context-switch point Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS Query Processing: Time breakdowns and bottlenecks Data Placement Hiding Latencies Query processing algorithms and instruction cache misses Chip multiprocessor DB architectures QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Evolution of hardware design Past: HW = CPU+Memory+Disk the 80s memory CPU today 1 cycle CPUs run faster than they access data
17 CMP, SMT, and memory Operation-level CPUparallelism memory affinity DBMS core design contradicts above goals Staged Database Systems [HA03] Staged software design allows for Cohort scheduling of queries to amortize loading time Suspend at module boundaries to maintain context Break DBMS into stages Stages act as independent servers Queries exist in the form of packets Proposed query scheduling algorithms to address locality/wait time tradeoffs [HA02] Staged Database Systems [HA03,HSA05] IN ISCAN AGGR OUT connect parser optimizer JOIN send results L1 L1 L2 L2 MEMORY MEMORY Optimize instruction/data cache locality Naturally enable multi-query processing Highly scalable, fault-tolerant, trustworthy FSCAN SORT
18 NUCA Memory hierarchy abstraction P P P P P P P P Private caches Large shared on-chip cache Data movement on CMP hierarchy Stencil computation OLTP Traditional DBMS: shared information in middle StagedDB: exposed data movement (data from Beckmann&Wood, Micro04) StagedCMP: StagedDB on Multicore µengines run independently on cores Dispatcher routes incoming tasks to cores Tradeoff: Work sharing at each uengine Load of each uengine CPU 1 CPU 2 Dispatcher Potential: better work sharing, load balance on CMP
19 Summary: Bridging the Gap Cache-aware data placement Eliminates unnecessary trips to memory Minimizes conflict/capacity misses Fates: decouple memory from storage layout What about compulsory (cold) misses? Can t avoid, but can hide latency with prefetching Techniques for B-trees, hash joins Addressing instruction stalls DSS: Call Graph Prefetching, SIMD, group operator OLTP: STEPS, a promising direction for any platform Staged Database Systems: a scalable future Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS TLP and network processors Programming model Methodology & Results QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS Modern Architecture & DBMS Instruction-Level Parallelism (ILP) Out-of-order (OoO) execution window Cache hierarchies - spatial / temporal locality DBMS memory system characteristics Limited locality (e.g., sequential scan) Random access patterns (e.g., hash join) Pointer-chasing (e.g., index scan, hash join) DBMS needs Memory-Level Parallelism (MLP)
20 Opportunities on NPUs DB operators on thread-parallel architectures Expose parallel misses to memory Leverage intra-operator parallelism Evaluation using network processors Designed for packet processing Abundant thread-level parallelism (64+) Speedups of 2.0X-2.5X on common operators Early insight on multi-core, multi-threaded architectures and DBMS execution Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS TLP and network processors Programming model Methodology & Results QUERY co-processing: GRAPHICS PROCESSORS CONCLUSIONS AND FUTURE DIRECTIONS TLP in database operators Sequential or index scan Fetch attributes in parallel Thread A Thread B Hash join Probe tuples in parallel Name ID Grade Smith 0 B Chen 1 A Name ID Grade Smith 0 B Chen 1 A Thread B Thread A
21 Network Processors Host CPU PCI Controller DRAM Controller DRAM Scratch Memory 8 Cores Intel IXP cores, each with 8 thread contexts Dedicated DDR SDRAM (up to 1GB) < 20W power dissipation Multi-threaded Core Host CPU DRAM PCI Controller DRAM Controller Local Cache Register Files Scratch Memory 8 Cores IXP2400 Context 0 Active Context Simple processing core 5-stage, single-issue 600MHz 2.5KB local cache Waiting Contexts Core Switch contexts at programmer s discretion Programming Model ISA supports thread switching Wait for specific hardware signal 4 cycle context switch (non-preemptive) Software-managed memory access Instructions for DRAM, local, scratch memories Programmer controls data access No OS or virtual memory Sensible for simple, long-running code
22 Multithreading in Action Thread States: Active Waiting Ready T0 Thread Contexts (1 core) T1 T2 DRAM Request DRAM Data time T7 DRAM Latency Aggregate T0 T1 T2 T7 T0 T1 T2 Methodology IXP2400 on prototype PCI card 256MB PC2100 SDRAM Separated from host CPU Pentium 4 Xeon 2.8GHz 8KB L1D, 512KB L2, 4 pending misses 3GB PC2100 SDRAM Workloads TPC-H orders and lineitems tables (250MB) Sequential scan, hash join Sequential Scan Use slotted page layout (8KB) Page Record header Attributes n n Network processor implementation Each page scanned by threads on one core Overlap individual record access within core Slot array
23 Sequential Scan Performance cores 4 cores 2 cores 1 core Performance limited by DRAM controller # Threads/Core Hash Join Setup Model probe phase key attributes 1 hash Assign pages of outer relation to one core Each thread context issues one probe Overlap dependent accesses within core Hash Join Performance cores 4 cores 2 cores 1 core Network processor finds MLP across tuples # Threads/Core
24 Conclusions DBMS on modern processor Need to exploit memory parallelism Require thread-level parallelism Multi-threaded, multi-core architecture Hide memory latency Evaluation using network processors Simple hardware, lots of threads Highly programmable Beat Pentium 4 by 2X-2.5X on DB operators Implications In our work, we found Substantial intra-operator parallelism Promising architecture: network processor Other trends of interest High-bandwidth peripherals (PCI Express) Large DRAM banks on NP-based cards Direct connections from NP card to SATA disks Suggest off-loading as query co-processor Query Coprocessing with NPs Front-end Query compilation Results reporting CPU Back-end Query execution Raw data access Network Processors Use right 'tool' for job! Disk Arrays
25 Outline INTRODUCTION AND OVERVIEW DBs on CONVENTIONAL PROCESSORS QUERY co-processing: NETWORK PROCESSORS QUERY co-processing: GRAPHICS PROCESSORS Graphics Processor Overview Mapping Computation to GPUs Database and data mining applications CONCLUSIONS AND FUTURE DIRECTIONS Query Processing Architectures CPU (3 GHz) System Memory (2 GB) AGP Memory (512 MB) PCI-E Bus (4 GB/s) GPU (500 MHz) Video Memory (512 MB) GPU (500 MHz) Video Memory (512 MB) Conventional Using GPUs GPU Draw stream of triangles Visibility of triangles CPU Stream of vertices count of visible pixels Stream of visible pixels Vertex Processing Engines Stream of transformed vertices Setup Engine Setup of setup commands and state Pixel Processing Engines Alpha test Stencil test Depth test
26 GPU Draw stream of triangles Visibility of triangles CPU Stream of vertices Stream of visible pixels Vertex Processing Engines Floating Point (32-bit) Stream of transformed vertices Setup Engine Floating Point (32-bit) Setup of setup commands and state Pixel Processing Engines Alpha test Stencil test Depth test Floating Point (32-bit) Data Parallelism (G70) CPU vs. GPU Pentium EE GHz Dual Core 230M Transistors 2 x 1MB Cache Price: $ 1,040 GeForce 7800 GTX 430 MHz 302M Transistors 512MB onboard memory Price: $ 450
27 CPU vs. GPU (Henry Moreton: NVIDIA, Aug. 2005) PEE GTX GPU/CPU Graphics GFLOPs Shader GFLOPs Die Area (mm2) Die Area normalized Transistors (M) Power (W) GFLOPS/mm GFLOPS/tr GFLOPS/W Exploiting Technology Moving Faster than Moore s Law GPU Growth Rate CPU Growth Rate Graphics Processors Optimized for graphics applications Dedicated memory interface Offloads the CPU Provides peak IO performance
28 GPGP: GPU-based Algorithms Database computations Data streaming Scientific computations Collision and proximity computations Motion planning and navigation Physically-based modeling More at Graphics Pipeline vertex setup rasterizer polygon pixel texture image programmable vertex processing (fp32) polygon setup, culling, rasterization programmable perpixel math (fp32) per-pixel texture, fp16 blending Z-buf, fp16 blending, anti-alias (MRT) memory NON-Graphics Pipeline Courtesy: David Kirk, Chief Scientist, NVIDIA data setup rasterizer lists data data data programmable MIMD processing (fp32) SIMD rasterization programmable SIMD processing (fp32) data fetch, fp16 blending predicated write, fp16 blend, multiple output memory
Architecture-Conscious Database Systems
Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query
More informationBridging the Processor/Memory Performance Gap in Database Applications
Bridging the Processor/Memory Performance Gap in Database Applications Anastassia Ailamaki Carnegie Mellon http://www.cs.cmu.edu/~natassa Memory Hierarchies PROCESSOR EXECUTION PIPELINE L1 I-CACHE L1 D-CACHE
More informationWeaving Relations for Cache Performance
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon Computer Platforms in 198 Execution PROCESSOR 1 cycles/instruction Data and Instructions cycles
More informationArchitecture-Conscious Database Systems
Architecture-Conscious Database Systems Anastassia Ailamaki Ph.D. Examination November 30, 2000 A DBMS on a 1980 Computer DBMS Execution PROCESSOR 10 cycles/instruction DBMS Data and Instructions 6 cycles
More informationWeaving Relations for Cache Performance
Weaving Relations for Cache Performance Anastassia Ailamaki Carnegie Mellon David DeWitt, Mark Hill, and Marios Skounakis University of Wisconsin-Madison Memory Hierarchies PROCESSOR EXECUTION PIPELINE
More informationWeaving Relations for Cache Performance
VLDB 2001, Rome, Italy Best Paper Award Weaving Relations for Cache Performance Anastassia Ailamaki David J. DeWitt Mark D. Hill Marios Skounakis Presented by: Ippokratis Pandis Bottleneck in DBMSs Processor
More informationWalking Four Machines by the Shore
Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationAnastasia Ailamaki. Performance and energy analysis using transactional workloads
Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationSTEPS Towards Cache-Resident Transaction Processing
STEPS Towards Cache-Resident Transaction Processing Stavros Harizopoulos joint work with Anastassia Ailamaki VLDB 2004 Carnegie ellon CPI OLTP workloads on modern CPUs 6 4 2 L2-I stalls L2-D stalls L1-I
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationSGI Challenge Overview
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationIntroduction to Multicore architecture. Tao Zhang Oct. 21, 2010
Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationComputer Architecture Crash course
Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationQuery co-processing on Commodity Processors. Processor Performance over Time * Focus of this Seminar
Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance over Time
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationQuery co-processing on Commodity Processors
Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Stavros Harizopoulos MIT Processor
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationEI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)
EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building
More informationQuery co-processing on Commodity Processors. Processor Performance over Time *
Query co-processing on Commodity Processors Anastassia Ailamaki Carnegie Mellon University Naga K. Govindaraju Dinesh Manocha University of North Carolina at Chapel Hill Processor Performance over Time
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationData Page Layouts for Relational Databases on Deep Memory Hierarchies
Data Page Layouts for Relational Databases on Deep Memory Hierarchies Anastassia Ailamaki David J. DeWitt Mark D. Hill Carnegie Mellon University natassa@cmu.edu University of Wisconsin - Madison {dewitt,
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationMainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation
Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer
More informationChapter 5. Large and Fast: Exploiting Memory Hierarchy
Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per
More informationMainstream Computer System Components
Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved
More informationMemory Hierarchy. Slides contents from:
Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationEECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont
Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationMain-Memory Databases 1 / 25
1 / 25 Motivation Hardware trends Huge main memory capacity with complex access characteristics (Caches, NUMA) Many-core CPUs SIMD support in CPUs New CPU features (HTM) Also: Graphic cards, FPGAs, low
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationMemory Hierarchies && The New Bottleneck == Cache Conscious Data Access. Martin Grund
Memory Hierarchies && The New Bottleneck == Cache Conscious Data Access Martin Grund Agenda Key Question: What is the memory hierarchy and how to exploit it? What to take home How is computer memory organized.
More informationCache-Aware Database Systems Internals Chapter 7
Cache-Aware Database Systems Internals Chapter 7 1 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationI/O Buffering and Streaming
I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationColumn Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin
Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin Problem Statement TB disks are coming! Superwide, frequently sparse tables are common DB
More informationCache-Aware Database Systems Internals. Chapter 7
Cache-Aware Database Systems Internals Chapter 7 Data Placement in RDBMSs A careful analysis of query processing operators and data placement schemes in RDBMS reveals a paradox: Workloads perform sequential
More information45-year CPU Evolution: 1 Law -2 Equations
4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationA Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis
A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design Edited by Mansour Al Zuair 1 Introduction Programmers want unlimited amounts of memory with low latency Fast
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationMo Money, No Problems: Caches #2...
Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationECE 574 Cluster Computing Lecture 16
ECE 574 Cluster Computing Lecture 16 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 26 March 2019 Announcements HW#7 posted HW#6 and HW#5 returned Don t forget project topics
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationComputer Architecture. Memory Hierarchy. Lynn Choi Korea University
Computer Architecture Memory Hierarchy Lynn Choi Korea University Memory Hierarchy Motivated by Principles of Locality Speed vs. Size vs. Cost tradeoff Locality principle Temporal Locality: reference to
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationKeywords and Review Questions
Keywords and Review Questions lec1: Keywords: ISA, Moore s Law Q1. Who are the people credited for inventing transistor? Q2. In which year IC was invented and who was the inventor? Q3. What is ISA? Explain
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationThe NVIDIA GeForce 8800 GPU
The NVIDIA GeForce 8800 GPU August 2007 Erik Lindholm / Stuart Oberman Outline GeForce 8800 Architecture Overview Streaming Processor Array Streaming Multiprocessor Texture ROP: Raster Operation Pipeline
More informationCSE 544 Principles of Database Management Systems
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 5 - DBMS Architecture and Indexing 1 Announcements HW1 is due next Thursday How is it going? Projects: Proposals are due
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More information