Array transposition in CUDA shared memory

Similar documents
Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

A Binarization Algorithm specialized on Document Images and Photos

Lecture 5: Multilayer Perceptrons

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

Random Kernel Perceptron on ATTiny2313 Microcontroller

GSLM Operations Research II Fall 13/14

Mathematics 256 a course in differential equations for engineering students

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Brave New World Pseudocode Reference

Lecture - Data Encryption Standard 4

ELEC 377 Operating Systems. Week 6 Class 3

Hermite Splines in Lie Groups as Products of Geodesics

3D vector computer graphics

Programming in Fortran 90 : 2017/2018

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

F Geometric Mean Graphs

SENSITIVITY ANALYSIS IN LINEAR PROGRAMMING USING A CALCULATOR

News. Recap: While Loop Example. Reading. Recap: Do Loop Example. Recap: For Loop Example

Load Balancing for Hex-Cell Interconnection Network

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

Private Information Retrieval (PIR)

On Some Entertaining Applications of the Concept of Set in Computer Science Course

Wishing you all a Total Quality New Year!

The stream cipher MICKEY-128 (version 1) Algorithm specification issue 1.0

Parallel matrix-vector multiplication

Parallelism for Nested Loops with Non-uniform and Flow Dependences

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

The Codesign Challenge

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

CE 221 Data Structures and Algorithms

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Introduction. Leslie Lamports Time, Clocks & the Ordering of Events in a Distributed System. Overview. Introduction Concepts: Time

Solving Planted Motif Problem on GPU

Sorting and Algorithm Analysis

Loop Transformations for Parallelism & Locality. Review. Scalar Expansion. Scalar Expansion: Motivation

Loop Permutation. Loop Transformations for Parallelism & Locality. Legality of Loop Interchange. Loop Interchange (cont)

Intro. Iterators. 1. Access

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Cache Memories. Lecture 14 Cache Memories. Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

An Optimal Algorithm for Prufer Codes *

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

y and the total sum of

CHAPTER 2 DECOMPOSITION OF GRAPHS

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

CS221: Algorithms and Data Structures. Priority Queues and Heaps. Alan J. Hu (Borrowing slides from Steve Wolfman)

Module Management Tool in Software Development Organizations

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

A fault tree analysis strategy using binary decision diagrams

TN348: Openlab Module - Colocalization

A New Approach For the Ranking of Fuzzy Sets With Different Heights

S1 Note. Basis functions.

Support Vector Machines

High-Boost Mesh Filtering for 3-D Shape Enhancement

Machine Learning: Algorithms and Applications

Compiling Process Networks to Interaction Nets

Edge Detection in Noisy Images Using the Support Vector Machines

Cable optimization of a long span cable stayed bridge in La Coruña (Spain)

A High-Quality, Energy Optimized, Real-Time Sampling Rate Conversion Library for the StrongARM Microprocessor

Reducing Frame Rate for Object Tracking

Accelerating X-Ray data collection using Pyramid Beam ray casting geometries


An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

Garbling Gadgets for Boolean and Arithmetic Circuits

Sorting. Sorting. Why Sort? Consistent Ordering

Specifications in 2001

CMPS 10 Introduction to Computer Science Lecture Notes

An Entropy-Based Approach to Integrated Information Needs Assessment

Problem Set 3 Solutions

Space-Optimal, Wait-Free Real-Time Synchronization

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

Solving Route Planning Using Euler Path Transform

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

ETAtouch RESTful Webservices

Parallel Solutions of Indexed Recurrence Equations

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Support Vector Machines

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

UB at GeoCLEF Department of Geography Abstract

The Erdős Pósa property for vertex- and edge-disjoint odd cycles in graphs on orientable surfaces

Related-Mode Attacks on CTR Encryption Mode

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Esc101 Lecture 1 st April, 2008 Generating Permutation

such that is accepted of states in , where Finite Automata Lecture 2-1: Regular Languages be an FA. A string is the transition function,

CHAPTER 10: ALGORITHM DESIGN TECHNIQUES

Solving two-person zero-sum game by Matlab

Outline. Midterm Review. Declaring Variables. Main Variable Data Types. Symbolic Constants. Arithmetic Operators. Midterm Review March 24, 2014

Storage Binding in RTL synthesis

Vectorization of Image Outlines Using Rational Spline and Genetic Algorithm

124 Chapter 8. Case Study: A Memory Component ndcatng some error condton. An exceptonal return of a value e s called rasng excepton e. A return s ssue

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Kinematics of pantograph masts

Transcription:

Array transposton n CUDA shared memory Mke Gles February 19, 2014 Abstract Ths short note s nspred by some code wrtten by Jeremy Appleyard for the transposton of data through shared memory. I had some dffculty gettng my head around t, and decded t would be helpful to have a few fgures to explan t. I ve also extended t slghtly to cover more general cases. 1

j Fgure 1: Illustraton of array to be wrtten nto, and read from 1 Objectve As llustrated n the fgure above we want to work wth a shared memory array whch s mathematcally of wdth I, and heght 32 (equal to the warp sze). We want to effectvely transpose some data by wrtng nto t wth the threads n the thread block fllng t row-wse, workng across the frst row, then the second, and so on n ascendng order of + ji, and then (after a synchronsaton) readng data out of t column-wse, workng up the frst column, then the second and so n n ascendng order of j + 32. Or, vce versa, we mght want to fll t by columns, and then read t out by rows. The man applcaton for ths s for loadng n (or storng) data whch s stored as a Array-of-Structs, each of sze I. To acheve a coalesced read from devce memory usng a sngle warp, the threads can load n contguous vectors from devce memory and fll the shared memory array by rows. Then they can read t back nto regsters from columns so that each thread gets ts requred struct data. The process would be reversed for wrtng back to devce memory. In both cases, the array ndex n devce memory would correspond to +ji. The second applcaton arses n the ADI solvers whch Jeremy and I are workng on. Here, for part of the calculaton, n order to maxmse coalesced memory transfers t s effcent for threads to work on the array row-wse ntally, but there s then a mddle secton n whch t s necessary to work on columns wth a separate warp for each column, before fnally revertng to the orgnal thread mappng for the fnal stage. The challenge s to come up wth a mappng (, j) k to an ndex k n the lnear shared memory array so that there are no memory bank conflcts when accessng the data n ether drecton. 2

j 10 11 12 13 14 5 6 7 8 9 0 1 2 3 4 Fgure 2: Shared memory ndces when I = 5. 2 I odd When I s odd we can defne k = + j I. Ths naturally gves no bank conflcts when readng row-wse, snce each warp gets 32 contguous addresses and current NVIDIA GPUs have 32 shared memory banks. Furthermore, there are no bank conflcts n each column, because j = 32 s the smallest strctly postve nteger such that j I mod 32 = 0 whch would lead to a bank conflct wth the element j =0. 3

padded by 1 37 38 39 40 33 34 35 36 28 29 30 31 j 3 I a power of 2 8 9 10 11 4 5 6 7 0 1 2 3 Fgure 3: Shared memory ndces when I = 4. When I s a power of 2, then the defnton k = + j I would lead to bank conflcts along each column. The frst bank conflct s when = 0, j = 8, k = 32. Ths suggests the dea of paddng by 1 after every 32 elements, gvng the mappng k = + j I + ( + ji)/32 where the dvson s nterpreted n the nteger sense (.e. dscardng the remander). 4

padded by 1 97 98 99 90 91 92 93 94 95 j 4 General I 0 1 2 3 4 5 Fgure 4: Shared memory ndces when I = 6. Havng handled the two extreme cases, now we consder the general case n whch I = P F, where P s a power of 2, and F s odd. In ths case k = + j I leads to the frst bank conflct when = 0 and If P 32 then ths mples j I mod 32 = 0. j F mod (32/P ) = 0, whch happens frst when j = 32/P, and hence ji = 32F. Thus, the padded defnton to avod conflcts s k = + j I + ( + j I)/(32F ). (Note: when I = F, then ( + j I)/(32F ) = 0 whch correctly gves us back the unpadded verson.) Alternatvely, f P > 32 then the frst bank conflct occurs when j = 1, and an approprate paddng s k = + j I + j. 5

Implementaton notes The smplest thng s to defne the (, j) pars for loadng and storng, and then compute the k for each. At worst, the paddng ncreases the shared memory requrements by approxmately 3%. The computaton of k requres 2 addtonal nteger operatons, a bt-shft and an addton. An alternatve, when I s even, s to use the mappng k = + j (I +1). Ths avods the 2 addtonal nteger operatons, but at the expense of usng more shared memory. 6