Optimizing Graphics Drivers with Streaming SIMD Extensions. Copyright 1999, Intel Corporation. All rights reserved.
|
|
- Ellen Thomas
- 6 years ago
- Views:
Transcription
1 Optimizing Graphics Drivers with Streaming SIMD Extensions 1
2 Agenda Features of Pentium III processor Graphics Stack Overview Driver Architectures Streaming SIMD Impact on D3D Streaming SIMD Impact on OpenGL* Summary Other Tips Resources *Third-party brands and names are the property of their respective owners. 2
3 Features of Streaming SIMD Extensions SIMD FP instructions SIMD FP for basic math & square root Fast approximations for reciprocal and reciprocal square root Cache-ability instructions/features Pre-fetching Non-temporal storage cache Streaming stores New integer SIMD instructions 3
4 Graphics Stack DirectX* D3D Graphics Application D3D "Pipeline" Retained Mode API Immediate Mode API Transformation and Lighting OpenGL* Graphics Application OpenGL ICD HAL/HEL Hardware Hardware *Third-party brands and names are the property of their respective owners. 4
5 Online vs Offline Drivers Traditional "Offline" Driver has at least two passes D3D Pipeline Vertices in D3D "TL Format" Buffers in HAL/HEL Vertices in proprietary format Buffers in AGP "Online" Driver uses a single pass for each "batch" HAL/HEL D3D Pipeline Vertices in D3D "TL Format" Buffers in AGP This example uses DirectX*, but Online Driver is currently more applicable to OpenGL* and other pipelines *Third-party brands and names are the property of their respective owners. 5
6 On-line Driver Advantages Better bus utilization Interleaved read/process/store More CPU/Graphics concurrency Higher TPS (unless "setup bound") Greater efficiency Eliminates a copy operation. Less code to maintain HAL becomes smaller and simpler 6
7 On-line Driver Caveats 3D chip must support vertex format DX: TL_Vertex, flexible vertex formats DX 6.1: Clipping done after pipeline reads vertices back from memory AGP memory is fast to write to, but slow to read from (not cached) OpenGL*: Clip without AGP read re-transform vertices of clipped triangles save vertices in AGP + cached temp buf Don t use On-line Driver for DX 6.1 *Third-party brands and names are the property of their respective owners. 7
8 Streaming SIMD Impact on D3D Optimizations already in DX6.1 Geometry and lighting pipeline IHVs can only optimize in HAL/HEL 3D data setup (e.g., color conversions) Moving data Texture wrap HEL for missing HW feature emulation A thick HEL can greatly benefit 8
9 Streaming SIMD Impact on OpenGL* Optimize for CPU and graphics chip Use single-pass, small batch pipeline Run a batch of vertices through all steps Better SIMD, cache usage, less overhead Intel GE OpenGL add-on for Pentium II and III, Microsoft* or SGI You can optimize ICD for CPU and GC *Third-party brands and names are the property of their respective owners. 9
10 ICD Streaming SIMD Opportunities Transformation and lighting Triangle data setup Back-face culling Clipcode calculations Clipping Color conversions Moving data 10
11 Triangle Setup Division via RCPPS and/or Newton- Raphson Reciprocal area Reciprocal Z Float to Fixed point conversion e.g. XYZ or texture coordinates Other XY modifications e.g. add 0.5 to all X and Y coordinates 11
12 Back-face Culling Some 3D hardware can cull but HAL usually has time to do it Cross product of two triangle edges Sign indicates front/back facing Result is measure of area Often can discard zero area and tiny tri s Some apps often generate zero-area tri s! E.g. discard if area is under 1/16th pixel area E.g. discard if area is under 1/16th pixel area In theory could leave pinholes - rarely seen 12
13 Copying on Pentium III Processor 450Mhz system, BX chipset Copy TL_VERTEX - 32 bytes Memory to memory with MOVNTPS 50 CPU clocks L1 cache to memory using MOVNTPS CPU clocks Numbers should approximate writing to USWC 13
14 Impact of Prefetch Useful on large (many line) loops compute bound or poor CPU/Bus overlap Can prefetch input AND output buffers Usually won t benefit D3D HALs data often passed to HAL in cache might prefetch L2 to L1 for large buffers Can benefit OpenGL* pipeline prefetch vertices and normals prefetch large temp buffer *Third-party brands and names are the property of their respective owners. 14
15 Branches Bad, SIMD Good Cause stalling, not SIMD Some branches can be avoided Conditional move: CMOV, MASKMOVQ Average: PAVGx Sum Absolute Differences: PSADBW Clamp/Saturate: MINPS/MAXPS (FP), PMAXxx/PMINxx PMINxx (INT) Select values: CMPPS, MOVMASK, ANDNPS, ORPS (FP), PCMPxxx, MOVQ, PANDN, POR (INT) Reduce branching 15
16 Minimizing the Negative Effect of Branches Try to move branches outside loops Some branches OK Well predicted branches Know the static branch prediction rules Forward conditional branches are predicted as not taken Backward conditional branches are predicted as taken Branch to avoid large block of code SIMD condition check: MOVMSKPS and PMOVMSKPB Make necessary branches cheaper 16
17 Branch example: Culling Non-SIMD: Culling mode (CW, CCW, None) branch is well predicted (no need to avoid) Triangle facing is not so well predicted Test triangle facing, CMOV pointer to cached memory to replace pointer to AGP memory, CMOV zero to count of bytes written to AGP, Write using pointer, add count to total 17
18 Better Culling in OpenGL* Ordered primitives Test in model space (before transform) reduces transform/lighting work Indexed primitives Test after transform in view or screen space must transform all vertices but not as many as with ordered primitives *Third-party brands and names are the property of their respective owners. 18
19 Better Clipping in OpenGL* Clip code generation is often used gl_ext_clip_volume - useful optimization works especially well for indexed primitives only generate code once per vertex Implement clip-hint extension app hints when objs fully in view *Third-party brands and names are the property of their respective owners. 19
20 Clip-code generation for OpenGL* Test XY in screen space must transform all vertices but not as many as with ordered primitives works well for indexed primitives Test XYZ in model space vs. view frustum drop out-of-view triangles before Xform less transform/lighting to do best for ordered triangles *Third-party brands and names are the property of their respective owners. 20
21 ClipCode Generation Scaler Integer No prefetch 60 clocks/vert Scaler Integer Prefetch 30 clocks/vert SIMD Integer Prefetch 20 clocks/vert All timings from a transform/clip func measure full time per vertex disable clip codes to get Xform time clip code time = total - Xform 21
22 OpenGL* Geometry Vertex Transform: to swizzle, or not? For large vertex sets, transpose to SoA get four vertices into to xxxx, yyyy, zzzz Use MOVLPS/MOVHPS for faster transpose For smaller sets, use SIMD but AoS load xyz, shuffle to xxxx, yyyy, zzzz, MULPS by matrix rows, ADDPS to get XYZW result somewhat slower per vertex than SoA X87 and SIMD FP have different precision Use scalar & packed (SS/PS) 22
23 Fast Transposed-Load Macro #define AosLoad( in, stride, x, y, z, w ) \ { m128 tmp ; \ x = _mm_loadl_pi( x, ( m64 *)(in) ); \ x = _mm_loadh_pi( x, ( m64 *)(stride + (char *)in ) ); \ y = _mm_loadl_pi( y, ( m64 *)(2*stride + (char *)in ) ); \ y = _mm_loadh_pi( y, ( m64 *)(3*stride + (char *)in ) ); \ tmp = _mm_shuffle_ps( x, y, _MM_SHUFFLE( 2, 0, 2, 0 ) );\ y = _mm_shuffle_ps( x, y, _MM_SHUFFLE( 3, 1, 3, 1 ) ); \ x = tmp ; \ \ z = _mm_loadl_pi( z, ( m64 *)(8 + (char *)in ) ); \ z = _mm_loadh_pi( z, ( m64 *)(stride (char *)in ) ); \ w = _mm_loadl_pi( w, ( m64 *)(2*stride (char *)in ) ); \ w = _mm_loadh_pi( w, ( m64 *)(3*stride (char *)in ) ); \ tmp = _mm_shuffle_ps( z, w, _MM_SHUFFLE( 2, 0, 2, 0 ) );\ w = _mm_shuffle_ps( z, w, _MM_SHUFFLE( 3, 1, 3, 1 ) ); \ z = tmp ; } Note that this example uses intrinsics rather than assembler 23
24 Geometry on Pentium III Processor x87 to/from L2 copy to mem 155 clks/vert Graph shows only the transformation step When lighting and projection are considered, the improvement is even bigger x87 to/from L2 52 clks/vert SIMD FP to/from L2 45 clks/vert SIMD FP from NTPS to AGP 40 clocks/vert Accelerates T & L, L, improves cache usage 24
25 OpenGL* RGBA Color Packing Given single precision R,G, B and A, we want to convert to packed 8 bit ints MULPS - multiply all elements by MINPS/MAXPS - clamp/saturate CVTTPS2PI, MOVHLPS, CVTTPS2PI MMX pack to bytes, store pack RGBA EMMS when done with all vertices *Third-party brands and names are the property of their respective owners. 25
26 Summary You can use Pentium III processor s Streaming SIMD Extensions and AGP to deliver more triangles-per-second to the graphics subsystem Wherever there is... vectorizable math or logic or large amounts of data there are big opportunities to run faster with the Pentium III processor 26
27 More Tips Unaligned: use MOVHPS and MOVLPS Cache line split less often than MOVUPS May save values for you (colors) Measure improvement with: Time stamps, frame counters Intel tools: VTune performance tool and ipeak Geometry subtests of major benchmarks: ZD s WinBench 3D geometry subtest (D3D) Sunset s Indy 3D (OpenGL*) *Third-party brands and names are the property of their respective owners. 27
28 Resources Pentium III processor information on the web: Numega's* Softice* * debugger New version supports Pentium III processor *Third-party brands and names are the property of their respective owners. 28
29 More Information 29
30 Intrinsics Pros Does all the register management for you Compiler interprets them, continues to fully optimize Easier to read for some people Some replace several steps Cons About 6% less efficient than the best asm language Must use Intel compiler for now takes a little longer to compile better at some optimizations, worse at others critical code coach (pro?) 30
31 Intel IPEAK Graphics Performance Toolkit Can analyze DirectX* 6.1 apps for: frames/sec triangle usage (#, pixels, # by pixels) texture utilization Helps detect performance limiting factors OpenGL* and D3D supported on Windows* 9x and Windows NT* 4.0 (5.0 soon) *Third-party brands and names are the property of their respective owners. 31
Using Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationInstruction Set Progression. from MMX Technology through Streaming SIMD Extensions 2
Instruction Set Progression from MMX Technology through Streaming SIMD Extensions 2 This article summarizes the progression of change to the instruction set in the Intel IA-32 architecture, from MMX technology
More informationIntel Architecture Optimization
Intel Architecture Optimization Reference Manual Copyright 1998, 1999 Intel Corporation All Rights Reserved Issued in U.S.A. Order Number: 245127-001 Intel Architecture Optimization Reference Manual Order
More information1.2.3 The Graphics Hardware Pipeline
Figure 1-3. The Graphics Hardware Pipeline 1.2.3 The Graphics Hardware Pipeline A pipeline is a sequence of stages operating in parallel and in a fixed order. Each stage receives its input from the prior
More informationApplication Note. November 28, Software Solutions Group
)DVW&RORU&RQYHUVLRQ8VLQJ 6WUHDPLQJ6,0'([WHQVLRQVDQG 00;Œ7HFKQRORJ\ Application Note November 28, 2001 Software Solutions Group 7DEOHRI&RQWHQWV 1.0 INTRODUCTION... 3 2.0 BACKGROUND & BASICS... 3 2.1 A COMMON
More informationCOSC 6385 Computer Architecture. - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available
More informationReal instruction set architectures. Part 2: a representative sample
Real instruction set architectures Part 2: a representative sample Some historical architectures VAX: Digital s line of midsize computers, dominant in academia in the 70s and 80s Characteristics: Variable-length
More informationOptimisation. CS7GV3 Real-time Rendering
Optimisation CS7GV3 Real-time Rendering Introduction Talk about lower-level optimization Higher-level optimization is better algorithms Example: not using a spatial data structure vs. using one After that
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationLRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.
LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E
More informationECE 574 Cluster Computing Lecture 16
ECE 574 Cluster Computing Lecture 16 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 26 March 2019 Announcements HW#7 posted HW#6 and HW#5 returned Don t forget project topics
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationUltimate Graphics Performance for DirectX 10 Hardware
Ultimate Graphics Performance for DirectX 10 Hardware Nicolas Thibieroz European Developer Relations AMD Graphics Products Group nicolas.thibieroz@amd.com V1.01 Generic API Usage DX10 designed for performance
More informationDrawing Fast The Graphics Pipeline
Drawing Fast The Graphics Pipeline CS559 Fall 2015 Lecture 9 October 1, 2015 What I was going to say last time How are the ideas we ve learned about implemented in hardware so they are fast. Important:
More information10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache
Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is
More informationGraphics Performance Optimisation. John Spitzer Director of European Developer Technology
Graphics Performance Optimisation John Spitzer Director of European Developer Technology Overview Understand the stages of the graphics pipeline Cherchez la bottleneck Once found, either eliminate or balance
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationSSE/SSE2 Toolbox Solutions for Real-Life SIMD Problems
SSE/SSE2 Toolbox Solutions for Real-Life SIMD Problems Alex.Klimovitski Klimovitski@intel.com Tools & Technologies Europe Intel Corporation Game Developer Conference 2001 Why SIMD? Why SSE? SIMD is the
More informationCOSC 6385 Computer Architecture - Memory Hierarchy Design (III)
COSC 6385 Computer Architecture - Memory Hierarchy Design (III) Fall 2006 Reducing cache miss penalty Five techniques Multilevel caches Critical word first and early restart Giving priority to read misses
More informationGeneral Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)
ME 290-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 2009 Performance: Bottlenecks Sources of bottlenecks CPU Transfer Processing Rasterizer
More informationThe Internet Streaming SIMD Extensions
The Internet Streaming SIMD Extensions Shreekant (Ticky) Thakkar, Microprocessor Products Group, Intel Corp. Tom Huff, Microprocessor Products Group, Intel Corp. ABSTRACT The paper describes the development
More informationDrawing Fast The Graphics Pipeline
Drawing Fast The Graphics Pipeline CS559 Spring 2016 Lecture 10 February 25, 2016 1. Put a 3D primitive in the World Modeling Get triangles 2. Figure out what color it should be Do ligh/ng 3. Position
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationMonday Morning. Graphics Hardware
Monday Morning Department of Computer Engineering Graphics Hardware Ulf Assarsson Skärmen består av massa pixlar 3D-Rendering Objects are often made of triangles x,y,z- coordinate for each vertex Y X Z
More informationMemory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple
Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationSpring 2009 Prof. Hyesoon Kim
Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationIntel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang 2006/12/25
Intel SIMD architecture Computer Organization and Assembly Languages Yung-Yu Chuang 2006/12/25 Reference Intel MMX for Multimedia PCs, CACM, Jan. 1997 Chapter 11 The MMX Instruction Set, The Art of Assembly
More informationGraphics Processing Unit Architecture (GPU Arch)
Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics
More informationReal-Time Rendering (Echtzeitgraphik) Michael Wimmer
Real-Time Rendering (Echtzeitgraphik) Michael Wimmer wimmer@cg.tuwien.ac.at Walking down the graphics pipeline Application Geometry Rasterizer What for? Understanding the rendering pipeline is the key
More informationApplications Tuning for Streaming SIMD Extensions
Applications Tuning for Streaming SIMD Extensions James Abel, Kumar Balasubramanian, Mike Bargeron, Tom Craver, Mike Phlipot, Microprocessor Products Group, Intel Corp. Index words: SIMD, streaming, MMX
More informationToday s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips
Today s Agenda DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips Optimization for DirectX 9 Graphics Mike Burrows, Microsoft - Performance
More informationIntel SIMD architecture. Computer Organization and Assembly Languages Yung-Yu Chuang
Intel SIMD architecture Computer Organization and Assembly Languages g Yung-Yu Chuang Overview SIMD MMX architectures MMX instructions examples SSE/SSE2 SIMD instructions are probably the best place to
More informationAutomatic Tuning Matrix Multiplication Performance on Graphics Hardware
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful
More informationUNIT- 5. Chapter 12 Processor Structure and Function
UNIT- 5 Chapter 12 Processor Structure and Function CPU Structure CPU must: Fetch instructions Interpret instructions Fetch data Process data Write data CPU With Systems Bus CPU Internal Structure Registers
More informationSSE and SSE2. Timothy A. Chagnon 18 September All images from Intel 64 and IA 32 Architectures Software Developer's Manuals
SSE and SSE2 Timothy A. Chagnon 18 September 2007 All images from Intel 64 and IA 32 Architectures Software Developer's Manuals Overview SSE: Streaming SIMD (Single Instruction Multiple Data) Extensions
More informationReal-Time Graphics Architecture
Real-Time Graphics Architecture Kurt Akeley Pat Hanrahan http://www.graphics.stanford.edu/courses/cs448a-01-fall Geometry Outline Vertex and primitive operations System examples emphasis on clipping Primitive
More informationGPU Memory Model Overview
GPU Memory Model Overview John Owens University of California, Davis Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization SciDAC Institute for Ultrascale Visualization
More informationProgramming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment
Programming Methods for the Pentium III Processor s Streaming SIMD Extensions Using the VTune Performance Enhancement Environment Joe H. Wolf III, Microprocessor Products Group, Intel Corporation Index
More informationAMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DYNA Performance
3. LS-DY Anwenderforum, Bamberg 2004 CAE / IT II AMD Opteron TM & PGI: Enabling the Worlds Fastest LS-DY Performance Tim Wilkens Ph.D. Member of Technical Staff tim.wilkens@amd.com October 4, 2004 Computation
More informationHardware-driven Visibility Culling Jeong Hyun Kim
Hardware-driven Visibility Culling Jeong Hyun Kim KAIST (Korea Advanced Institute of Science and Technology) Contents Introduction Background Clipping Culling Z-max (Z-min) Filter Programmable culling
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on
More informationOptimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic
More informationH.264 Decoding. University of Central Florida
1 Optimization Example: H.264 inverse transform Interprediction Intraprediction In-Loop Deblocking Render Interprediction filter data from previously decoded frames Deblocking filter out block edges Today:
More informationOptimizing DirectX Graphics. Richard Huddy European Developer Relations Manager
Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationSWAR: MMX, SSE, SSE 2 Multiplatform Programming
SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationCell Programming Tips & Techniques
Cell Programming Tips & Techniques Course Code: L3T2H1-58 Cell Ecosystem Solutions Enablement 1 Class Objectives Things you will learn Key programming techniques to exploit cell hardware organization and
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationGeForce3 OpenGL Performance. John Spitzer
GeForce3 OpenGL Performance John Spitzer GeForce3 OpenGL Performance John Spitzer Manager, OpenGL Applications Engineering jspitzer@nvidia.com Possible Performance Bottlenecks They mirror the OpenGL pipeline
More informationWorking with Metal Overview
Graphics and Games #WWDC14 Working with Metal Overview Session 603 Jeremy Sandmel GPU Software 2014 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission
More informationSoftware Optimization: Fixing Memory Problems
Software Optimization: Fixing Memory Problems Abstract Improving memory performance is of paramount importance when attempting to optimize software speed. Luckily, there are a number of techniques that
More informationCOSC 6385 Computer Architecture - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationAge nda. Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications
Intel PXA27x Processor Family: An Applications Processor for Phone and PDA applications N.C. Paver PhD Architect Intel Corporation Hot Chips 16 August 2004 Age nda Overview of the Intel PXA27X processor
More informationChapter 12. CPU Structure and Function. Yonsei University
Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor
More informationDrawing Fast The Graphics Pipeline
Drawing Fast The Graphics Pipeline CS559 Fall 2016 Lectures 10 & 11 October 10th & 12th, 2016 1. Put a 3D primitive in the World Modeling 2. Figure out what color it should be 3. Position relative to the
More informationRendering Objects. Need to transform all geometry then
Intro to OpenGL Rendering Objects Object has internal geometry (Model) Object relative to other objects (World) Object relative to camera (View) Object relative to screen (Projection) Need to transform
More informationChapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST
Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial
More informationScene Management. Video Game Technologies 11498: MSc in Computer Science and Engineering 11156: MSc in Game Design and Development
Video Game Technologies 11498: MSc in Computer Science and Engineering 11156: MSc in Game Design and Development Chap. 5 Scene Management Overview Scene Management vs Rendering This chapter is about rendering
More informationDouble-precision General Matrix Multiply (DGEMM)
Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply
More informationHardware Speculation Support
Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification
More informationChapter 2 Lecture 1 Computer Systems Organization
Chapter 2 Lecture 1 Computer Systems Organization This chapter provides an introduction to the components Processors: Primary Memory: Secondary Memory: Input/Output: Busses The Central Processing Unit
More informationComputer Organization & Assembly Language Programming
Computer Organization & Assembly Language Programming CSE 2312-002 (Fall 2011) Lecture 8 ISA & Data Types & Instruction Formats Junzhou Huang, Ph.D. Department of Computer Science and Engineering Fall
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationAdapted from instructor s supplementary material from Computer. Patterson & Hennessy, 2008, MK]
Lecture 17 Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] SRAM / / Flash / RRAM / HDD SRAM / / Flash / RRAM/ HDD SRAM
More informationMemory Hierarchy Basics
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases
More informationReal - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský
Real - Time Rendering Pipeline optimization Michal Červeňanský Juraj Starinský Motivation Resolution 1600x1200, at 60 fps Hw power not enough Acceleration is still necessary 3.3.2010 2 Overview Application
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationThe x86 Microprocessors. Introduction. The 80x86 Microprocessors. 1.1 Assembly Language
The x86 Microprocessors Introduction 1.1 Assembly Language Numbering and Coding Systems Human beings use the decimal system (base 10) Decimal digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 Computer systems use the
More informationCS451Real-time Rendering Pipeline
1 CS451Real-time Rendering Pipeline JYH-MING LIEN DEPARTMENT OF COMPUTER SCIENCE GEORGE MASON UNIVERSITY Based on Tomas Akenine-Möller s lecture note You say that you render a 3D 2 scene, but what does
More informationToday. Rendering pipeline. Rendering pipeline. Object vs. Image order. Rendering engine Rendering engine (jtrt) Computergrafik. Rendering pipeline
Computergrafik Today Rendering pipeline s View volumes, clipping Viewport Matthias Zwicker Universität Bern Herbst 2008 Rendering pipeline Rendering pipeline Hardware & software that draws 3D scenes on
More informationSqueezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques
Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques Jonathan Zarge, Team Lead Performance Tools Richard Huddy, European Developer Relations Manager ATI
More informationFinal Lecture. A few minutes to wrap up and add some perspective
Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection
More informationGraphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal
Graphics Hardware, Graphics APIs, and Computation on GPUs Mark Segal Overview Graphics Pipeline Graphics Hardware Graphics APIs ATI s low-level interface for computation on GPUs 2 Graphics Hardware High
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationProfiling and Debugging Games on Mobile Platforms
Profiling and Debugging Games on Mobile Platforms Lorenzo Dal Col Senior Software Engineer, Graphics Tools Gamelab 2013, Barcelona 26 th June 2013 Agenda Introduction to Performance Analysis with ARM DS-5
More informationMali Developer Resources. Kevin Ho ARM Taiwan FAE
Mali Developer Resources Kevin Ho ARM Taiwan FAE ARM Mali Developer Tools Software Development SDKs for OpenGL ES & OpenCL OpenGL ES Emulators Shader Development Studio Shader Library Asset Creation Texture
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationE.Order of Operations
Appendix E E.Order of Operations This book describes all the performed between initial specification of vertices and final writing of fragments into the framebuffer. The chapters of this book are arranged
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =
More informationOptimization of Lattice QCD codes for the AMD Opteron processor
Optimization of Lattice QCD codes for the AMD Opteron processor Miho Koma (DESY Hamburg) ACAT2005, DESY Zeuthen, 26 May 2005 We report the current status of the new Opteron cluster at DESY Hamburg, including
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationLecture 25: Board Notes: Threads and GPUs
Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationConvolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam
Convolution Soup: A case study in CUDA optimization The Fairmont San Jose Joe Stam Optimization GPUs are very fast BUT Poor programming can lead to disappointing performance Squeaking out the most speed
More informationProcessors, Performance, and Profiling
Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationHPC VT Machine-dependent Optimization
HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler
More informationCS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III
CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationCaches Concepts Review
Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationLecture 18: DRAM Technologies
Lecture 18: DRAM Technologies Last Time: Cache and Virtual Memory Review Today DRAM organization or, why is DRAM so slow??? Lecture 18 1 Main Memory = DRAM Lecture 18 2 Basic DRAM Architecture Lecture
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More information