SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL
|
|
- Adelia Kelly
- 6 years ago
- Views:
Transcription
1
2 SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt
3 I: INTRODUCTION 3 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
4 LOEWE-CSC An AMD based supercomputer 786 GPU compute nodes 2 x 2.1 GHz 12-core AMD Opteron 1 AMD Radeon 5870 GPU 40 high density compute nodes 4 x 2.1 GHz 12 core AMD Opteron 32 GiB RAM / CPU 16.5 TB/s bisectional network bandwidth 1.62 PB local storage New HPL version and own DGEMM kernel 299 Gflops overall # 22 Top 500 November 2010 # 8 Green 500 November Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
5 DGEMM AND LINPACK Optimizing Step by Step The given optimization problems LINPACK Usually HPL Solves A * x = b Gaussian Elimination with partial pivotization Performance dominated by DGEMM DGEMM General Matrix Multiply C = alpha * A * B + beta * C The Optimization Steps GPU DGEMM Kernel DGEMM on a one GPU system DGEMM on a multi-gpu system Moving from Cypress to Cayman Tuning HPL for a GPU DGEMM Using Interlagos in the Future 5 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
6 II: DGEMM KERNELS 6 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
7 DGEMM KERNELS Matrix Blocking DGEMM = General Matrix Multiply C = alpha * A * B + beta * C A * B on GPU A: nxk Matrix, B: kxm Matrix, C: nxm Matrix Complexity: 2 * m * n * k Floating Point Operations 2 * m * n * k Memory Fetches Blocking: Ail = ax1 matrix; Blj = 1Xb matrix; Cij = axb matrix Caching Ai and Bj reduces memory fetches One block per thread More registers required for larger blocking (e.g. 41 for 8x8) Block size vs. Wavefront count Larger Blocking reduces memory fetches Smaller Blocking increases number of Wavefronts 7 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
8 KERNEL TUNING Parameters of an optimal kernel Assume square C matrix for simplicity Optimal kernel parameters determined Experimentally 4 x 4 blocking B matrix transposed, A matrix not transposed Pixel Shader kernel and no Compute Shader Output using Color Buffers not MemExport Output buffer located either in host or GPU Memory (Depends on GPU and Chipset) Loop unrolled with an unrolling factor of two Texture cache is used in tiled mode Hardcoded K = 1024 Identical for Cypress and Cayman 8 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
9 MATRIX SIZE DEPENDENCE OF THE KERNEL Cypress Kernel Performance depends on: K = Length of scalar product H = # of lines / rows in C matrix calculated per kernel launch Cypress prefers large values for both Good: K = 1024, H >= 1024 (best >= 3072) H can be chosen depending on M and N of A and B. Kernel Peak Performance: 494 Gflop/s 9 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
10 MATRIX SIZE DEPENDENCE OF THE KERNEL Cayman Kernel Performance depends on: K = Length of scalar product H = # of lines / rows in C matrix calculated per kernel launch Cypress prefers large values for both Good: K = 1024, H >= 1024 (best >= 3072) Cayman prefers smaller values values for both Larger matrices must be cut into blocks Kernel Peak Performance: 617 Gflop/s 10 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
11 III: DGEMM SYSTEM PERFORMANCE 11 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
12 FULL DGEMM Feeding work to the GPU C Matrix does not fit in GPU memory Divide problem into blocks Chose perfect problem sizes for GPU Remainder from block construction processed by CPU For each A-submatrix, all B-submatrices are iterated over this fixed A-submatrix. The B-submatrices stay on the GPU, submatrices of A are not necessarily stored. (halves memory need) Still, each matrix is transferred exactly once. 12 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
13 GPU SCHEDULING A simple approach GPU only calculates A * B + C done on host GPU requires special memory layout Pre and post processing required DivideBuffer Transform A and B as required MergeBuffer Add GPU result (A*B) and C Performance loss due to GPU idle time 13 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
14 GPU SCHEDULING Building a pipeline Pipeline minimizes GPU idle time One Divide + one merge DMA must be started before kernel Iterate blocks of B over fixed A Minimize PCIe transfers 14 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
15 DGEMM SYSTEM PERFORMANCE Hiding PCIe 15 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
16 DGEMM KERNELS Matrix Tiling Using the CPU only for the remainder part of the tiling wastes CPU power. Distribute workload among GPU/CPU Phase 1b: Remainder Part Phase 1a: Large rectangular Block (size chosen such that CPU finishes slightly before GPU) Phase 2: Second rectangular Block Phase 3: Single Tiles (Work Stealing) Size of rectangular blocks chosen based on vacant matrix size and performance estimations. Processing larger rectangular blocks increases CPU DGEMM performance compared to single tiles. CPU / GPU DGEMM performance continuously monitored. Monitored performance data used to improve estimations for next DGEMM call. 16 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
17 DGEMM SYSTEM PERFORMANCE Matrix size effects 17 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
18 IV: MULTI-GPU DGEMM 18 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
19 MATRIX DISTRIBUTION Matrix distribution among n GPUs B Matrix split in n parts. Each GPU processes only one part of B. Buffer requirement reduced. After the first GPU has processed its part, the remaining tiles are processed in a round-robin fashion. 19 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
20 DUAL-GPU PERFORMANCE Analysis of dual-gpu version Accumulated Max. Perf. is the accumulated DGEMM performance of all contributing processing elements. The accumulated Max. Perf. is corrected for the CPU cores for GPU pre- and postprocessing to approximate performance of best case implementation. The efficiency is the ratio of the achieved performance and this best case performance. 20 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
21 MULTI-GPU BANDWIDTH REQUIREMENTS Multi-GPU DGEMM has tremendous memory and PCI-Express throughput requirements Reading from and writing to the C-matrix requires at least: (g Performance in Gflop/s, s size of Element in bytes, i.e. 8 for double precision floating point) p(k) = g * s / 2k m(k) = 2 g * s / k p(1024) = 1.82 * n [GB/s] p(2048) = 0.91 * n [GB/s] m(1024) = 7.27 * n [GB/s] m(2048) = 3.63 * n [GB/s] Additional throughput required for concurrent CPU DGEMM. PCI performance sufficient, even for two GPUs via PCI-express switch. Memory performance possibly insufficient, depends on k. 21 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
22 MULTI-GPU PERFORMANCE Multi-GPU performance depends greatly on k Larger k decreases memory bandwidth requirements. Performance becomes constant for large k. Predictions for Quad-GPU and more The k parameter gets even more critical. GPU memory requirement scales linearly with k. Bounded k range due to GPU memory limitations. 22 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
23 TRIPPLE-GPU PERFORMANCE Speed scales almost linearly to three GPUs The faster 5870 GPU core scales worse than the two slower 9350 and The 5970 works like two Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
24 CPU UTILIZATION CPU utilization per core during DGEMM and HPL Core 0 performs preprocessing and management of DMA transfer. One core sufficient for preprocessing and management of up to two GPUs. Postprocessing multithreaded for slow CPU / dual GPU Optimized Implementation Preprocessing multithreaded for more then two GPUs. Threads for pre- and postprecissing pinned to the CPU die with closest connection to the GPU. 24 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
25 DGEMM PERFORMANCE Maximum DGEMM performance achieved: 5870 GPU, GPU only 465 Gflop/s Two Magny-Cours 6172 CPUs, 5870 GPU 625 Gflop/s 8 Nehalem cores 2.26 GHz, 5970 GPU 832 Gflop/s Two Magny-Cours 6174 CPUs, three 5870 GPUs 1432 Gflop/s 25 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
26 V: GOING FROM CYPRESS TO CAYMAN 26 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
27 CAYMAN & DMA Multiple Buffers for B-Matrix Two Buffers for A Matrix (round-robin use) Three Buffers for Output (round-robin use) 27 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
28 CAYMAN & DMA Multiple Buffers for B-Matrix Two Buffers for A Matrix (round-robin use) Three Buffers for Output (round-robin use) Transfer to GPU performed by DMA Engine Two additional page locked buffers per input matrix on host side (round robin use) 28 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
29 CAYMAN & DMA Multiple Buffers for B-Matrix Two Buffers for A Matrix (round-robin use) Three Buffers for Output (round-robin use) Transfer to GPU performed by DMA Engine Two additional page locked buffers per input matrix on host side (round robin use) Addition of DGEMM done by host PROBLEM: Cayman shows poor performance when using DMA engine. 29 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
30 CAYMAN & DMA DMA Performance (GPU to CPU) Kernel must write to 128 bit buffer. Kernel can write directly to host memory in 128 bit format bypassing the DMA engine. 30 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
31 CAYMAN & DMA DMA Performance (CPU to GPU) Cayman shows full DMA speed for 64 bit transfers. Kernel must read from 128 bit buffer for full performance. Different DMA Path (1b) introduced. DMA Performance (GPU to CPU) Kernel must write to 128 bit buffer. Kernel can write directly to host memory in 128 bit format bypassing the DMA engine. Using 64 bit conversion Infeasible. 31 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
32 CAYMAN & DMA Multi-GPU All buffers replicated on each device. Processing split along B-Matrix. Each GPU caches only a part of the B- Matrix. 32 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
33 VI: HPL 33 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
34 HPL (HIGH PERFORMANCE LINPACK) Linpack iteratively factorizes a dense system of linear equations. Each iteration consists of: Panel factorization Panel broadcast Line swapping & U-broadcast U-matrix update (DTRSM) C-matrix update (DGEMM) HPL utilizes BLAS routines Major contribution to workload by DGEMM. (95.6% of total execution time) 34 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
35 NAIVE HPL-GPU IMPLEMENTATION First naive HPL-GPU implementation offloads the large DGEMM for C-matrix update to GPU. GPU-DGEMM makes up 78% of total execution time. GPU is idling at 22% of the time. As only DGEMM has a considerable contribution to the overall calculation effort, the HPL performance is limited to 78% of DGEMM performance. 35 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
36 HIDING FACTORIZATION AND BROADCAST TIME (LOOKAHEAD) Factorization requires only the first NB colums of the previous iteration to finish Factorization and broadcast for next iteration can be processed in parallel with the DGEMM. Two issues occur: Running pre- and postprocessing in parallel with factorization leads to memory congestion. During broadcast, only one CPU core is active. 36 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
37 IMPROVED LOOKAHEAD CPU DGEMM split in two parts to improve utilization during broadcast. CPU cores idle by intention during factorization to avoid memory congestion and ensure full GPU DGEMM performance. Binary patch of the AMD driver changes memory policies and decreases page fault rate. 37 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
38 HIDING PIVOTIZATION TIME (LOOKAHEAD 2) Also pivotization time can be hidden HPL performance improves significantly with lookahead. 38 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
39 HPL PERFORMANCE Multi-node and single node performance is constantly higher with lookahead. 39 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
40 SINGLE GPU PEAK PERFORMANCES Peak performance achieved at LOEWE-CSC. Two 6172 Magny-Cours CPU 2.1 GHz, GPU. Multi node per node performance achieves 93.6% of single-node performance. Discipline Performance Peak Efficiency DGEMM Kernel % GPU DGEMM System % GPU/CPU DGEMM System % Single-node HPL % Multi-node HPL % 40 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
41 MULTI-GPU HPL Concept More threads used for pre and post processing. Dynamic factorization thread count improves performance, especially for k = NB >= Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
42 MULTI-GPU HPL Benchmarks First multi-gpu benchmarks: (2 * 6174 CPU, 3 * 5870 GPU) HPL: 1114 Gflop/s DGEMM: 1432 Gflop/s Multi GPU Efficiency: (2 * 6174 CPU, 3 * V7800 GPU) HPL: 1230 Gflop/s / Watt Efficiency optimized version offloads as much workload as possible to the GPU, even if this leaves the CPU idling. 42 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
43 VII: INTERLAGOS 43 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
44 INTERLAGOS Looking ahead Expectations Optimized BLAS libaries achieve peak performance for any CPU. Since two cores of a module share floating point unit, DGEMM may not have to run on all cores. These cores can be used for GPU pre/postprocessing. No big changes necessary, CPU contribution to overall performance small, at least for multi-gpu. Problems Support for 3DNow! dropped. GotoBLAS no longer adopted to new CPUs. Patches to GotoBLAS for CPU core reservation need to be ported to another BLAS library. 44 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
45 HETEROGENEOUS SYSTEMS Work Distribution Interlagos as update option For LOEWE-CSC as additional nodes Traditional distribution targets homogeneous systems LOEWE-CSC already heterogenous Quad nodes without GPU 45 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
46 HETEROGENEOUS SYSTEMS Work Distribution Groups nodes into performance classes Size submatrices according to performance Skip process rows during allocation 46 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
47 HETEROGENEOUS SYSTEMS Benchmarks Benchmarks on 6 node setup 2 Quads MHz MHz Original version speed ~ 6 times Quad speed ~ 25% granularity loss Optimized version ~ 3 % granularity loss 47 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
48 SUMMARY 48 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
49 SUMMARY High Performance DGEMM Kernels 494 Gflops % of peak High Performance CPU/GPU DGEMM Automatic load balancing 625 Gflops 84% of peak Modified HPL Keeping GPU busy 563 Gflops 76% of peak Scaling to multi-gpu nodes Scaling to many nodes Minimum granularity loss on heterogeneous systems Open Source: Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
50 QUESTIONS 50 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
51 Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied. 51 Scaling DGEMM to Multiple Cayman GPUs and Interlagos Many-core CPUs for HPL June 15, 2011
Technical Report, CALDGEMM and HPL
Technical Report, CALDGEMM and HPL David Rohr, Matthias Kretz, Matthias Bach Frankfurt Institute for Advanced Studies, University of Frankfurt, Germany December 9, 21 Abstract The LOEWE-CSC cluster at
More informationEFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT
EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT JOSEPH L. GREATHOUSE, MAYANK DAGA AMD RESEARCH 11/20/2014 THIS TALK IN ONE SLIDE Demonstrate how to save space and time
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationUse cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games
Viewdle Inc. 1 Use cases Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games 2 Why OpenCL matter? OpenCL is going to bring such
More informationADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer
ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session 2117 Olivier Zegdoun AMD Sr. Software Engineer CONTENTS Rendering Effects Before Fusion: single discrete GPU case Before Fusion: multiple discrete
More informationINTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD
INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different
More informationOPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER
OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER Budirijanto Purnomo AMD Technical Lead, GPU Compute Tools PRESENTATION OVERVIEW Motivation AMD APP Profiler
More informationAMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011
AMD IOMMU VERSION 2 How KVM will use it Jörg Rödel August 16th, 2011 AMD IOMMU VERSION 2 WHAT S NEW? 2 AMD IOMMU Version 2 Support in KVM August 16th, 2011 Public NEW FEATURES - OVERVIEW Two-level page
More informationACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research
ACCELERATING MATRIX PROCESSING WITH GPUs Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research ACCELERATING MATRIX PROCESSING WITH GPUS MOTIVATION Matrix operations
More informationviewdle! - machine vision experts
viewdle! - machine vision experts topic using algorithmic metadata creation and heterogeneous computing to build the personal content management system of the future Page 2 Page 3 video of basic recognition
More informationHyperTransport Technology
HyperTransport Technology in 2009 and Beyond Mike Uhler VP, Accelerated Computing, AMD President, HyperTransport Consortium February 11, 2009 Agenda AMD Roadmap Update Torrenza, Fusion, Stream Computing
More informationOpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data
OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data Andrew Miller Computer Vision Group Research Developer 3-D TERRAIN RECONSTRUCTION
More informationAMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD
AMD APU and Processor Comparisons AMD Client Desktop Feb 2013 AMD SUMMARY 3DMark released Feb 4, 2013 Contains DirectX 9, DirectX 10, and DirectX 11 tests AMD s current product stack features DirectX 11
More informationAMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution
AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION 1 3DMark Overview February 2013 Approved for public distribution 2 3DMark Overview February 2013 Approved for public distribution
More informationCAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to the features, functionality, availability, timing,
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationHIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager
HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager INTRODUCTION WHO WE ARE 3 Highly Parallel Computing in Physics-based
More informationTHE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD
THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU
More informationSIMULATOR AMD RESEARCH JUNE 14, 2015
AMD'S gem5apu SIMULATOR AMD RESEARCH JUNE 14, 2015 OVERVIEW Introducing AMD s gem5 APU Simulator Extends gem5 with a GPU timing model Supports Heterogeneous System Architecture in SE mode Includes several
More informationMaximizing Six-Core AMD Opteron Processor Performance with RHEL
Maximizing Six-Core AMD Opteron Processor Performance with RHEL Bhavna Sarathy Red Hat Technical Lead, AMD Sanjay Rao Senior Software Engineer, Red Hat Sept 4, 2009 1 Agenda Six-Core AMD Opteron processor
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationAMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES
AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES VERSION 1 - FEBRUARY 2018 CONTACT Address Advanced Micro Devices, Inc 7171 Southwest Pkwy Austin, Texas 78735 United States Phone
More informationEXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS
EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS James Ross High Performance Technologies, Inc (HPTi) Computational Scientist Edward Carmack David Richie Song Park, Brian Henz and Dale Shires HPTi
More informationThe mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management
Next-Generation Mobile Computing: Balancing Performance and Power Efficiency HOT CHIPS 19 Jonathan Owen, AMD Agenda The mobile computing evolution The Griffin architecture Memory enhancements Power management
More informationAMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution
AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION 1 3DMark Overview April 2013 Approved for public distribution 2 3DMark Overview April 2013 Approved for public distribution
More informationCAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected
More informationBIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD
BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM Dong Ping Zhang Heterogeneous System Architecture AMD VASCULATURE ENHANCEMENT 3 Biomedical data analysis on heterogeneous platform June, 2012 EXAMPLE:
More informationMIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011
MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise June 2011 FREE LUNCH IS OVER, CODES HAVE TO MIGRATE! Many existing legacy codes needs to migrate to
More informationAnatomy of AMD s TeraScale Graphics Engine
Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream
More informationThe Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015
The Road to the AMD Fiji GPU Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015 Fiji Chip DETAILED LOOK 4GB High-Bandwidth Memory 4096-bit wide interface 512 GB/s
More informationPanel Discussion: The Future of I/O From a CPU Architecture Perspective
Panel Discussion: The Future of I/O From a CPU Architecture Perspective Brad Benton AMD, Inc. #OFADevWorkshop Issues Move to Exascale involves more parallel processing across more processing elements GPUs,
More informationFusion Enabled Image Processing
Fusion Enabled Image Processing I Jui (Ray) Sung, Mattieu Delahaye, Isaac Gelado, Curtis Davis MCW Strengths Complete Tools Port, Explore, Analyze, Tune Training World class R&D team Leading Algorithms
More informationSOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015
SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015 PROBLEM Shaders are compiled in draw calls Emulating certain features in shaders Drivers keep shaders in some intermediate representation And insert
More informationSTREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer
STREAMING VIDEO DATA INTO 3D APPLICATIONS Session 2116 Christopher Mayer AMD Sr. Software Engineer CONTENT Introduction Pinned Memory Streaming Video Data How does the APU change the game 3 Streaming Video
More informationCAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s positioning in the datacenter market; expected
More informationThe Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015
The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA
More informationCAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s strategy and focus, expected datacenter total
More informationGPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011
GPGPU COMPUTE ON AMD Udeepta Bordoloi April 6, 2011 WHY USE GPU COMPUTE CPU: scalar processing + Latency + Optimized for sequential and branching algorithms + Runs existing applications very well - Throughput
More informationFLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions
FLASH MEMORY SUMMIT 2011 Adoption of Caching & Hybrid Solutions Market Overview 2009 Flash production reached parity with all other existing solid state memories in terms of bites. 2010 Overall flash production
More informationAccelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration
Accelerating Applications the art of maximum performance computing James Spooner Maxeler VP of Acceleration Introduction The Process The Tools Case Studies Summary What do we mean by acceleration? How
More informationMulti-core processors are here, but how do you resolve data bottlenecks in native code?
Multi-core processors are here, but how do you resolve data bottlenecks in native code? hint: it s all about locality Michael Wall October, 2008 part I of II: System memory 2 PDC 2008 October 2008 Session
More informationDesigning Natural Interfaces
Designing Natural Interfaces So what? Computers are everywhere C.T.D.L.L.C. Computers that don t look like computers. Computers that don t look like Computers Computers that don t look like Computers
More informationGestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO
Gestural and Cinematic Interfaces - DX11 David Brebner Unlimited Realities CTO Gestural and Cinematic Interfaces DX11 Making an emotional connection with users 3 Unlimited Realities / Fingertapps About
More informationAMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling
AMD Radeon Open Compute Platform Felix Kuehling ROCM PLATFORM ON LINUX Compiler Front End AMDGPU Driver Enabled with ROCm GCN Assembly Device LLVM Compiler (GCN) LLVM Opt Passes GCN Target Host LLVM Compiler
More informationMEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE
MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE VIGNESH ADHINARAYANAN, INDRANI PAUL, JOSEPH L. GREATHOUSE, WEI HUANG, ASHUTOSH PATTNAIK, WU-CHUN FENG POWER AND ENERGY ARE FIRST-CLASS
More informationINTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS
INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ AMD RESEARCH, ADVANCED MICRO DEVICES, INC. MODERN SYSTEMS ARE POWERED BY HETEROGENEITY
More informationGeneric System Calls for GPUs
Generic System Calls for GPUs Ján Veselý*, Arkaprava Basu, Abhishek Bhattacharjee*, Gabriel H. Loh, Mark Oskin, Steven K. Reinhardt *Rutgers University, Indian Institute of Science, Advanced Micro Devices
More informationHeterogeneous Computing
Heterogeneous Computing Featured Speaker Ben Sander Senior Fellow Advanced Micro Devices (AMD) DR. DOBB S: GPU AND CPU PROGRAMMING WITH HETEROGENEOUS SYSTEM ARCHITECTURE Ben Sander AMD Senior Fellow APU:
More informationFUSION PROCESSORS AND HPC
FUSION PROCESSORS AND HPC Chuck Moore AMD Corporate Fellow & Technology Group CTO June 14, 2011 Fusion Processors and HPC Today: Multi-socket x86 CMPs + optional dgpu + high BW memory Fusion APUs (SPFP)
More informationRun Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008
Run Anywhere The Hardware Platform Perspective Ben Pollan, AMD Java Labs October 28, 2008 Agenda Java Labs Introduction Community Collaboration Performance Optimization Recommendations Leveraging the Latest
More informationHPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D
HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D AMD GRAPHIC CORE NEXT Low Power High Performance Graphics & Parallel Compute Michael Mantor AMD Senior Fellow Architect Michael.mantor@amd.com Mike Houston AMD
More information3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components
3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components Xudong An, Manish Arora, Wei Huang, William C. Brantley, Joseph L. Greathouse AMD Research Advanced Micro Devices, Inc. MOTIVATION
More informationHPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture
HPCA 18 Reliability-aware Data Placement for Heterogeneous memory Architecture Manish Gupta Ψ, Vilas Sridharan*, David Roberts*, Andreas Prodromou Ψ, Ashish Venkat Ψ, Dean Tullsen Ψ, Rajesh Gupta Ψ Ψ *
More informationD3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD
D3D12 & Vulkan: Lessons learned Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD D3D12 What s new? DXIL DXGI & UWP updates Root Signature 1.1 Shader cache GPU validation PIX D3D12 / DXIL DXBC
More informationAMD EPYC CORPORATE BRAND GUIDELINES
AMD EPYC CORPORATE BRAND GUIDELINES VERSION 1 MAY 2017 CONTACT Address Advanced Micro Devices, Inc 7171 Southwest Pkwy Austin, Texas 78735 United States Phone 1-512-602-1000 Online Email: Brand.Team@amd.com
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationLinux Network Tuning Guide for AMD EPYC Processor Based Servers
Linux Network Tuning Guide for AMD EPYC Processor Application Note Publication # 56224 Revision: 1.00 Issue Date: November 2017 Advanced Micro Devices 2017 Advanced Micro Devices, Inc. All rights reserved.
More information1 Presentation Title Month ##, 2012
1 Presentation Title Month ##, 2012 Malloc in OpenCL kernels Why and how? Roy Spliet Bsc. (r.spliet@student.tudelft.nl) Delft University of Technology Student Msc. Dr. A.L. Varbanescu Prof. Dr. Ir. H.J.
More informationDR. LISA SU
CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to AMD s strategy and focus, expected datacenter total
More informationIntroducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs
, Inc. Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs Doug Finke Director of Product Marketing September 2016
More informationKVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015
KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015 AGENDA Background & Motivation Challenges Native Page Tables Emulating the OS Kernel 2 KVM CPU MODEL IN SYSCALL EMULATION
More informationAMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016
AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING BILL.BRANTLEY@AMD.COM, FELLOW 3 OCTOBER 2016 AMD S VISION FOR EXASCALE COMPUTING EMBRACING HETEROGENEITY CHAMPIONING OPEN SOLUTIONS ENABLING LEADERSHIP
More informationROCm: An open platform for GPU computing exploration
UCX-ROCm: ROCm Integration into UCX {Khaled Hamidouche, Brad Benton}@AMD Research ROCm: An open platform for GPU computing exploration 1 JUNE, 2018 ISC ROCm Software Platform An Open Source foundation
More informationAMD S X86 OPEN64 COMPILER. Michael Lai AMD
AMD S X86 OPEN64 COMPILER Michael Lai AMD CONTENTS Brief History AMD and Open64 Compiler Overview Major Components of Compiler Important Optimizations Recent Releases Performance Applications and Libraries
More informationHigh Performance Graphics 2010
High Performance Graphics 2010 1 Agenda Radeon 5xxx Product Family Highlights Radeon 5870 vs. 4870 Radeon 5870 Top-Level Radeon 5870 Shader Core References / Links / Screenshots Questions? 2 ATI Radeon
More information1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING
1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING TASK-PARALLELISM OpenCL, CUDA, OpenMP (traditionally) and the like are largely data-parallel models Their core unit of parallelism
More informationDesktop Telepresence Arrived! Sudha Valluru ViVu CEO
Desktop Telepresence Arrived! Sudha Valluru ViVu CEO 3 Desktop Telepresence Arrived! Video Collaboration market Telepresence Telepresence Cost Expensive Expensive HW HW Legacy Apps Interactivity ViVu CONFIDENTIAL
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationAMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide
AMD HD3D Technology Setup Guide 1 AMD HD3D TECHNOLOGY: Setup Guide Contents AMD HD3D Technology... 3 Frame Sequential Displays... 4 Supported 3D Display Hardware... 5 AMD Display Drivers... 5 Configuration
More informationLIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT
LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT Bootstrapping the industry for better VR experience Complimentary to HMD SDKs It s all about giving developers the tools they want! AMD LIQUIDVR
More informationEPYC VIDEO CUG 2018 MAY 2018
AMD UPDATE CUG 2018 EPYC VIDEO CRAY AND AMD PAST SUCCESS IN HPC AMD IN TOP500 LIST 2002 TO 2011 2011 - AMD IN FASTEST MACHINES IN 11 COUNTRIES ZEN A FRESH APPROACH Designed from the Ground up for Optimal
More informationRegMutex: Inter-Warp GPU Register Time-Sharing
RegMutex: Inter-Warp GPU Register Time-Sharing Farzad Khorasani* Hodjat Asghari Esfeden Amin Farmahini-Farahani Nuwan Jayasena Vivek Sarkar *farkhor@gatech.edu The 45 th International Symposium on Computer
More informationSequential Consistency for Heterogeneous-Race-Free
Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD JUNE 12, 2013 EXECUTIVE
More informationMicrosoft Windows 2016 Mellanox 100GbE NIC Tuning Guide
Microsoft Windows 2016 Mellanox 100GbE NIC Tuning Guide Publication # 56288 Revision: 1.00 Issue Date: June 2018 2018 Advanced Micro Devices, Inc. All rights reserved. The information contained herein
More informationOptimizing the operations with sparse matrices on Intel architecture
Optimizing the operations with sparse matrices on Intel architecture Gladkikh V. S. victor.s.gladkikh@intel.com Intel Xeon, Intel Itanium are trademarks of Intel Corporation in the U.S. and other countries.
More informationAMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008
AMD 780G an x86 chipset with advanced integrated GPU Hot Chips 2008 Niles Burbank AMD Agenda Evolving PC expectations AMD 780G Overview Design Challenges Video Playback Support Display Capabilities Power
More informationLinux Network Tuning Guide for AMD EPYC Processor Based Servers
Linux Network Tuning Guide for AMD EPYC Processor Application Note Publication # 56224 Revision: 1.10 Issue Date: May 2018 Advanced Micro Devices 2018 Advanced Micro Devices, Inc. All rights reserved.
More informationAMD RYZEN CORPORATE BRAND GUIDELINES
AMD RYZEN CORPORATE BRAND GUIDELINES VERSION 4 - JULY 2017 CONTACT Address Advanced Micro Devices, Inc 7171 Southwest Pkwy Austin, Texas 78735 United States Phone Phone: 1-512-602-1000 Online Email: Brand.Team@amd.com
More informationNEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer
NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer SESSION AGENDA Quick Keywords Abstract and Scope Introduction Current User Interface [ UI
More informationPROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017
PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017 BACKGROUND-- HARDWARE MEMORY ENCRYPTION AMD Secure Memory Encryption (SME) / AMD Secure Encrypted Virtualization (SEV) Hardware AES engine
More informationCilk Plus: Multicore extensions for C and C++
Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationオープンソ プンソース技術者のための AMD 最新テクノロジーアップデート 日本 AMD 株式会社 マーケティング ビジネス開発本部 エンタープライズプロダクトマーケティング部 山野 洋幸
AMD AMD CPU 2 Happy 6 th Birthday AMD Opteron Processor 3 6コア Istanbul : 完全な進捗状況 Executing months ahead of schedule In collaboration with GLOBALFOUNDRIES: first tapeout to production World s only six-core
More informationPattern-based analytics to estimate and track yield risk of designs down to 7nm
DAC 2017 Pattern-based analytics to estimate and track yield risk of designs down to 7nm JASON CAIN, MOUTAZ FAKHRY (AMD) PIYUSH PATHAK, JASON SWEIS, PHILIPPE HURAT, YA-CHIEH LAI (CADENCE) INTRODUCTION
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationPerformance of the AMD Opteron LS21 for IBM BladeCenter
August 26 Performance Analysis Performance of the AMD Opteron LS21 for IBM BladeCenter Douglas M. Pase and Matthew A. Eckl IBM Systems and Technology Group Page 2 Abstract In this paper we examine the
More informationclarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018
clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018 ANECDOTE DISCOVERING A BUFFER OVERFLOW CPU GPU MEMORY MEMORY Data Data Data Data Data 2 clarmor: A
More informationVulkan (including Vulkan Fast Paths)
Vulkan (including Vulkan Fast Paths) Łukasz Migas Software Development Engineer WS Graphics Let s talk about OpenGL (a bit) History 1.0-1992 1.3-2001 multitexturing 1.5-2003 vertex buffer object 2.0-2004
More informationMULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK
MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK Wei-Lien Hsu AMD SMTS Gongyuan Zhuang AMD MTS OUTLINE Motivation OpenDecode Video deblurring algorithms Acceleration by clamdfft
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationAMD AIB Partner Guidelines. Version February, 2015
AMD AIB Partner Guidelines Version 1.0 - February, 2015 The Purpose of This Document These guidelines provide direction for our Add-in-Board (AIB) partners and customers to market the benefits of AMD products
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationAccelerating Linpack with CUDA on heterogeneous clusters
Accelerating Linpack with CUDA on heterogeneous clusters Massimiliano Fatica NVIDIA Corporation 2701 San Tomas Expressway Santa Clara CA 95050 mfatica@nvidia.com ABSTRACT This paper describes the use of
More informationAMD Radeon ProRender plug-in for Unreal Engine. Installation Guide
AMD Radeon ProRender plug-in for Unreal Engine Installation Guide This document is a guide on how to install and configure AMD Radeon ProRender plug-in for Unreal Engine. DISCLAIMER The information contained
More information1401 HETEROGENEOUS HPC How Fusion Designs Can Advance Science
1401 HETEROGENEOUS HPC How Fusion Designs Can Advance Science Ben Bergen Los Alamos National Laboratory Research Scientist Marcus Daniels Los Alamos National Laboratory Research Scientist VPIC Team Brian
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More information