A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions

Similar documents
Geometric data structures:

Approximate Nearest Neighbor Search. Deng Cai Zhejiang University

Warped parallel nearest neighbor searches using kd-trees

Multi Agent Navigation on GPU. Avi Bleiweiss

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Locality- Sensitive Hashing Random Projections for NN Search

Algorithms for Nearest Neighbors

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Large-scale visual recognition Efficient matching

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

A Systems View of Large- Scale 3D Reconstruction

CS 179: GPU Computing

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

CS6963: Parallel Programming for GPUs Midterm Exam March 25, 2009

Algorithms for Nearest Neighbors

Generic Polyphase Filterbanks with CUDA

Predicting GPU Performance from CPU Runs Using Machine Learning

ECE 8823: GPU Architectures. Objectives

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Spatial Data Management

Module 3: Hashing Lecture 9: Static and Dynamic Hashing. The Lecture Contains: Static hashing. Hashing. Dynamic hashing. Extendible hashing.

Benchmarking the Memory Hierarchy of Modern GPUs

Multidimensional Indexes [14]

over Multi Label Images

Spatial Data Management

CSE 527: Introduction to Computer Vision

Introduction to GPU programming with CUDA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Similarity Searching Techniques in Content-based Audio Retrieval via Hashing

Efficient Computation of Radial Distribution Function on GPUs

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

MapReduce Locality Sensitive Hashing GPUs. NLP ML Web Andrew Rosenberg

Accelerating Spark Workloads using GPUs

Concurrency for data-intensive applications

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja (UBC), David G. Lowe (Google) IEEE 2014

POST-SIEVING ON GPUs

Searching in one billion vectors: re-rank with source coding

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

GPU-based Distributed Behavior Models with CUDA

Sorting and Searching. Tim Purcell NVIDIA

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Using GPUs to compute the multilevel summation of electrostatic forces

9/23/2009 CONFERENCES CONTINUOUS NEAREST NEIGHBOR SEARCH INTRODUCTION OVERVIEW PRELIMINARY -- POINT NN QUERIES

Modern Processor Architectures. L25: Modern Compiler Design

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Going nonparametric: Nearest neighbor methods for regression and classification

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Rongrong Ji (Columbia), Yu Gang Jiang (Fudan), June, 2012

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallel Processing SIMD, Vector and GPU s cont.

A Generic Inverted Index Framework for Similarity Search on the GPU

High Performance Computing on GPUs using NVIDIA CUDA

Query Answering Using Inverted Indexes

Accelerated Machine Learning Algorithms in Python

1.3 Data processing; data storage; data movement; and control.

Parallel Architectures

Multidimensional Indexing The R Tree

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Semantic Image Search. Alex Egg

GPU Implementation of a Multiobjective Search Algorithm

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

CLSH: Cluster-based Locality-Sensitive Hashing

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

A Parallel Access Method for Spatial Data Using GPU

Going nonparametric: Nearest neighbor methods for regression and classification

Co-synthesis and Accelerator based Embedded System Design

Handout 3. HSAIL and A SIMT GPU Simulator

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Web- Scale Mul,media: Op,mizing LSH. Malcolm Slaney Yury Li<shits Junfeng He Y! Research

FASTER UPPER BOUNDING OF INTERSECTION SIZES

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

GPU Programming Using NVIDIA CUDA

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Introduction to Spatial Database Systems

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

Fast Knowledge Discovery in Time Series with GPGPU on Genetic Programming

Cartoon parallel architectures; CPUs and GPUs

Parallel Graph Component Labelling with GPUs and CUDA

Intersection Acceleration

Hybrid Implementation of 3D Kirchhoff Migration

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Computational Geometry

Announcements. Written Assignment2 is out, due March 8 Graded Programming Assignment2 next Tuesday

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser

Characterization of CUDA and KD-Tree K-Query Point Nearest Neighbor for Static and Dynamic Data Sets

University of Bielefeld

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

Warps and Reduction Algorithms

Transcription:

A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions Lee A Carraher, Philip A Wilsey, and Fred S Annexstein School of Electronic and Computing Systems University of Cincinnati May 24, 2013

2 / 17 Parallel Nearest Neighbor Search An approximate Nearest Neighbor algorithm in parallel GPGPU implementation of the Leech Lattice decoder for LSH NN Other non-sequential parts of LSH NN in parallel (projection, sorting) Optimized for the CUDA architecture

3 / 17 Motivations NN is an intuitive, powerful basis for machine learning algorithms Approximate answers are good enough for big data GPU provides many benefits and challenges High ALU to Control allocation ratio Remap for data intensive processing Memory transfer bottlenecks (no MMU or cache) Getting it to fit in shared memory

4 / 17 NN, KNN, c-approx NN, cr-approx NN NN problem: given a query vector, return the nearest vector in a set of vectors by minimizing some distance function KNN is NN but the K nearest vectors are returned (NN is KNN where K=1) c approx NN returns a NN within a constant ɛ-distances of the optimal NN cr approx NN extends the c-approx NN by returning all the vectors within a radius r of the query vector

5 / 17 Problems With Other Parallel Search GPU Linear Search Fast, simple implementation Linear search complexity and speedup All vectors must be in MM GPU Kd-Tree Search Vectors do not have to be in MM Work well for d<20 Parallel DFS has issues scaling

6 / 17 Locality Sensitive Hashing for NN Nearest Neighbor search algorithm based on LSH functions Collision probability corresponds to closeness Collision Probability is a pseudo-distance function Only consider colliding vectors Exhaustively search colliding vectors

LSH Query Multiple Times Hashtable of Near Neighbors Datapoint Labels Query R-project vectors to LSH dim Compute Lattice Hash Create Sequence of near neighbors Compute Real Distances and Sort 7 / 17

8 / 17 An Example Hash Family 0 w 1 w 2 3 w r Figure: Random Projection of R 2 R 1

9 / 17 ECC as Hash Functions The goal of ECCs are the same as a good LSH function Send data over a noisy channel and correct the errors Only a subset of data is valid (codes/hashes) Codes correct within an error correcting radius Shannon s Limit (optimal codes exist, but are complex) Fast implementation (not as important as cpu clock>>channel capacity) Need to find a compromise between decoding complexity and the native dimensionality of the ECC

Comparison of ECC Decoders Error Correcting Performance (3072000 Samples) 1 Probability of Bit Error - log10(pb) 2 3 4 5 Leech hex -2db and QAM16 E8 2PSK Unencoded QAM64 Unencoded QAM16 0 5 10 15 20 25 EB/NO (db) Leech Lattice partitions R 24, in 519 operations 10 / 17

11 / 17 Parallel Λ 24 Decoder Block Decoding Thread Block Global Mem r[0:12] r[12:24] r[0:12] r[12:24] QAM(A_pts) QAM(A_pts) QAM(B_pts) QAM(B_pts) d_ij[0:12] d_ij[12:24] d_ij[0:12] d_ij[12:24] Block Conf(even) Block Conf(odd) Block Conf(even) Block Conf(odd) mues prefrepe muos prefrepo mues prefrepe muos prefrepo MinH6 MinH6 MinH6 MinH6 weight y weight y weight y weight y H_parity H_parity H_parity H_parity weight Codeword weight Codeword weight Codeword weight Codeword K_parity K_parity K_parity K_parity weight_ae Codeword_AE weight_ao Codeword_AO weight_be Codeword_Be weight_bo Codeword_BO Compare weights

12 / 17 GPU tweak: Avoid Branching Warps (sets of 16 threads) run as SIMD Branches cause a warp to be rescheduled and run again 99998% chance for uniform boolean test, (1 2 16 ) Conversion provides a 68% Speedup for our algorithm Conditional Code IF dista<distb: ELSE: d ij [4i]=distA d ijk [4i]=distB kparities[4i]=0 d ij [4i]=distB d ijk [4i]=distA kparities[4i]= 1 Sequential Access Code d = dista<distb d ij [4i]=distA*d+distB*( d) d ijk [4i]=distB*d+distA*( d) kparities[4i] = ( d)

13 / 17 Results For SIFT Vectors 70 60 Comparison of Time for LSH and Linear Search Linear Search Times LSH Search Times 50 Average Time (s) 40 30 20 10 0 0 100 200 300 400 500 600 700 800 Number of Images

14 / 17 Result Extrapolation For Larger DBs 70 60 Extrapolated Linear Search Extrapolated Comparison of Time for LSH and Linear Search (Semilog) 10 4 50 Average Time 40 30 10 3 20 10 Extrapolated Linear Search y= 045x+ 005 0 0 200 400 600 800 1000 Number of Images Extrapolated LSH Search 16 14 Average Time log-scale 10 2 10 1 10 0 12 Average Time 10 08 06 04 02 Extrapolated LSH y= 015x^ 3671+ 006 00 0 200 400 600 800 1000 Number of Images 10-1 Extrapolated Linear Search Extrapolated LSH 10-2 0 20000 40000 60000 80000 100000 Number of Images

Ratio of Sequential To Parallel Code 094 Query Database: Parallel/Total Compute Time for SIFT Samples 093 092 Ratio Par/Seq 091 090 089 SIFT sample data samples avg=09340, R2=283203799077e-05 088 0 50 100 150 200 250 300 Samples 15 / 17

Scaled Speedup Results 30 Query: Total Speedup as a function of Parallel Speedup 25 Total Speedup 20 15 10 5 Parallelism Our Theoretical Speedup (934%) 90% Serial Code 95% Serial Code 97% Serial Code Theoretical Speedup 0 50 100 150 200 Parallel Speedup 16 / 17

17 / 17 Conclusions GPU parallel implementations of lattice hashing accelerates the NN algorithm Required sequential parts of the LSH algorithm will always cause bottlenecks Within the processor count range of some medium embedded applications (UAVs, Cellphones, Surveillance Systems) speedup scales Tweaking parts of the LSH algorithm for the GPU Random Projection Calculation (bulk or per query) Hash list sorting and intersection (per-query) Linear searching of candidate NNs (for exact NN)