A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions

Size: px

Start display at page:

Download "A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions"

Lester Bryan
6 years ago
Views:

1 A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions Lee A Carraher, Philip A Wilsey, and Fred S Annexstein School of Electronic and Computing Systems University of Cincinnati May 24, 2013

2 2 / 17 Parallel Nearest Neighbor Search An approximate Nearest Neighbor algorithm in parallel GPGPU implementation of the Leech Lattice decoder for LSH NN Other non-sequential parts of LSH NN in parallel (projection, sorting) Optimized for the CUDA architecture

3 3 / 17 Motivations NN is an intuitive, powerful basis for machine learning algorithms Approximate answers are good enough for big data GPU provides many benefits and challenges High ALU to Control allocation ratio Remap for data intensive processing Memory transfer bottlenecks (no MMU or cache) Getting it to fit in shared memory

4 4 / 17 NN, KNN, c-approx NN, cr-approx NN NN problem: given a query vector, return the nearest vector in a set of vectors by minimizing some distance function KNN is NN but the K nearest vectors are returned (NN is KNN where K=1) c approx NN returns a NN within a constant ɛ-distances of the optimal NN cr approx NN extends the c-approx NN by returning all the vectors within a radius r of the query vector

5 5 / 17 Problems With Other Parallel Search GPU Linear Search Fast, simple implementation Linear search complexity and speedup All vectors must be in MM GPU Kd-Tree Search Vectors do not have to be in MM Work well for d<20 Parallel DFS has issues scaling

6 6 / 17 Locality Sensitive Hashing for NN Nearest Neighbor search algorithm based on LSH functions Collision probability corresponds to closeness Collision Probability is a pseudo-distance function Only consider colliding vectors Exhaustively search colliding vectors

7 LSH Query Multiple Times Hashtable of Near Neighbors Datapoint Labels Query R-project vectors to LSH dim Compute Lattice Hash Create Sequence of near neighbors Compute Real Distances and Sort 7 / 17

8 8 / 17 An Example Hash Family 0 w 1 w 2 3 w r Figure: Random Projection of R 2 R 1

9 9 / 17 ECC as Hash Functions The goal of ECCs are the same as a good LSH function Send data over a noisy channel and correct the errors Only a subset of data is valid (codes/hashes) Codes correct within an error correcting radius Shannon s Limit (optimal codes exist, but are complex) Fast implementation (not as important as cpu clock>>channel capacity) Need to find a compromise between decoding complexity and the native dimensionality of the ECC

10 Comparison of ECC Decoders Error Correcting Performance ( Samples) 1 Probability of Bit Error - log10(pb) Leech hex -2db and QAM16 E8 2PSK Unencoded QAM64 Unencoded QAM EB/NO (db) Leech Lattice partitions R 24, in 519 operations 10 / 17

11 11 / 17 Parallel Λ 24 Decoder Block Decoding Thread Block Global Mem r[0:12] r[12:24] r[0:12] r[12:24] QAM(A_pts) QAM(A_pts) QAM(B_pts) QAM(B_pts) d_ij[0:12] d_ij[12:24] d_ij[0:12] d_ij[12:24] Block Conf(even) Block Conf(odd) Block Conf(even) Block Conf(odd) mues prefrepe muos prefrepo mues prefrepe muos prefrepo MinH6 MinH6 MinH6 MinH6 weight y weight y weight y weight y H_parity H_parity H_parity H_parity weight Codeword weight Codeword weight Codeword weight Codeword K_parity K_parity K_parity K_parity weight_ae Codeword_AE weight_ao Codeword_AO weight_be Codeword_Be weight_bo Codeword_BO Compare weights

12 12 / 17 GPU tweak: Avoid Branching Warps (sets of 16 threads) run as SIMD Branches cause a warp to be rescheduled and run again 99998% chance for uniform boolean test, ( ) Conversion provides a 68% Speedup for our algorithm Conditional Code IF dista<distb: ELSE: d ij [4i]=distA d ijk [4i]=distB kparities[4i]=0 d ij [4i]=distB d ijk [4i]=distA kparities[4i]= 1 Sequential Access Code d = dista<distb d ij [4i]=distA*d+distB*( d) d ijk [4i]=distB*d+distA*( d) kparities[4i] = ( d)

13 13 / 17 Results For SIFT Vectors Comparison of Time for LSH and Linear Search Linear Search Times LSH Search Times 50 Average Time (s) Number of Images

14 14 / 17 Result Extrapolation For Larger DBs Extrapolated Linear Search Extrapolated Comparison of Time for LSH and Linear Search (Semilog) Average Time Extrapolated Linear Search y= 045x Number of Images Extrapolated LSH Search Average Time log-scale Average Time Extrapolated LSH y= 015x^ Number of Images 10-1 Extrapolated Linear Search Extrapolated LSH Number of Images

15 Ratio of Sequential To Parallel Code 094 Query Database: Parallel/Total Compute Time for SIFT Samples Ratio Par/Seq SIFT sample data samples avg=09340, R2= e Samples 15 / 17

16 Scaled Speedup Results 30 Query: Total Speedup as a function of Parallel Speedup 25 Total Speedup Parallelism Our Theoretical Speedup (934%) 90% Serial Code 95% Serial Code 97% Serial Code Theoretical Speedup Parallel Speedup 16 / 17

17 17 / 17 Conclusions GPU parallel implementations of lattice hashing accelerates the NN algorithm Required sequential parts of the LSH algorithm will always cause bottlenecks Within the processor count range of some medium embedded applications (UAVs, Cellphones, Surveillance Systems) speedup scales Tweaking parts of the LSH algorithm for the GPU Random Projection Calculation (bulk or per query) Hash list sorting and intersection (per-query) Linear searching of candidate NNs (for exact NN)

Geometric data structures:

Geometric data structures: Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Sham Kakade 2017 1 Announcements: HW3 posted Today: Review: LSH for Euclidean distance Other