A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions Lee A Carraher, Philip A Wilsey, and Fred S Annexstein School of Electronic and Computing Systems University of Cincinnati May 24, 2013
2 / 17 Parallel Nearest Neighbor Search An approximate Nearest Neighbor algorithm in parallel GPGPU implementation of the Leech Lattice decoder for LSH NN Other non-sequential parts of LSH NN in parallel (projection, sorting) Optimized for the CUDA architecture
3 / 17 Motivations NN is an intuitive, powerful basis for machine learning algorithms Approximate answers are good enough for big data GPU provides many benefits and challenges High ALU to Control allocation ratio Remap for data intensive processing Memory transfer bottlenecks (no MMU or cache) Getting it to fit in shared memory
4 / 17 NN, KNN, c-approx NN, cr-approx NN NN problem: given a query vector, return the nearest vector in a set of vectors by minimizing some distance function KNN is NN but the K nearest vectors are returned (NN is KNN where K=1) c approx NN returns a NN within a constant ɛ-distances of the optimal NN cr approx NN extends the c-approx NN by returning all the vectors within a radius r of the query vector
5 / 17 Problems With Other Parallel Search GPU Linear Search Fast, simple implementation Linear search complexity and speedup All vectors must be in MM GPU Kd-Tree Search Vectors do not have to be in MM Work well for d<20 Parallel DFS has issues scaling
6 / 17 Locality Sensitive Hashing for NN Nearest Neighbor search algorithm based on LSH functions Collision probability corresponds to closeness Collision Probability is a pseudo-distance function Only consider colliding vectors Exhaustively search colliding vectors
LSH Query Multiple Times Hashtable of Near Neighbors Datapoint Labels Query R-project vectors to LSH dim Compute Lattice Hash Create Sequence of near neighbors Compute Real Distances and Sort 7 / 17
8 / 17 An Example Hash Family 0 w 1 w 2 3 w r Figure: Random Projection of R 2 R 1
9 / 17 ECC as Hash Functions The goal of ECCs are the same as a good LSH function Send data over a noisy channel and correct the errors Only a subset of data is valid (codes/hashes) Codes correct within an error correcting radius Shannon s Limit (optimal codes exist, but are complex) Fast implementation (not as important as cpu clock>>channel capacity) Need to find a compromise between decoding complexity and the native dimensionality of the ECC
Comparison of ECC Decoders Error Correcting Performance (3072000 Samples) 1 Probability of Bit Error - log10(pb) 2 3 4 5 Leech hex -2db and QAM16 E8 2PSK Unencoded QAM64 Unencoded QAM16 0 5 10 15 20 25 EB/NO (db) Leech Lattice partitions R 24, in 519 operations 10 / 17
11 / 17 Parallel Λ 24 Decoder Block Decoding Thread Block Global Mem r[0:12] r[12:24] r[0:12] r[12:24] QAM(A_pts) QAM(A_pts) QAM(B_pts) QAM(B_pts) d_ij[0:12] d_ij[12:24] d_ij[0:12] d_ij[12:24] Block Conf(even) Block Conf(odd) Block Conf(even) Block Conf(odd) mues prefrepe muos prefrepo mues prefrepe muos prefrepo MinH6 MinH6 MinH6 MinH6 weight y weight y weight y weight y H_parity H_parity H_parity H_parity weight Codeword weight Codeword weight Codeword weight Codeword K_parity K_parity K_parity K_parity weight_ae Codeword_AE weight_ao Codeword_AO weight_be Codeword_Be weight_bo Codeword_BO Compare weights
12 / 17 GPU tweak: Avoid Branching Warps (sets of 16 threads) run as SIMD Branches cause a warp to be rescheduled and run again 99998% chance for uniform boolean test, (1 2 16 ) Conversion provides a 68% Speedup for our algorithm Conditional Code IF dista<distb: ELSE: d ij [4i]=distA d ijk [4i]=distB kparities[4i]=0 d ij [4i]=distB d ijk [4i]=distA kparities[4i]= 1 Sequential Access Code d = dista<distb d ij [4i]=distA*d+distB*( d) d ijk [4i]=distB*d+distA*( d) kparities[4i] = ( d)
13 / 17 Results For SIFT Vectors 70 60 Comparison of Time for LSH and Linear Search Linear Search Times LSH Search Times 50 Average Time (s) 40 30 20 10 0 0 100 200 300 400 500 600 700 800 Number of Images
14 / 17 Result Extrapolation For Larger DBs 70 60 Extrapolated Linear Search Extrapolated Comparison of Time for LSH and Linear Search (Semilog) 10 4 50 Average Time 40 30 10 3 20 10 Extrapolated Linear Search y= 045x+ 005 0 0 200 400 600 800 1000 Number of Images Extrapolated LSH Search 16 14 Average Time log-scale 10 2 10 1 10 0 12 Average Time 10 08 06 04 02 Extrapolated LSH y= 015x^ 3671+ 006 00 0 200 400 600 800 1000 Number of Images 10-1 Extrapolated Linear Search Extrapolated LSH 10-2 0 20000 40000 60000 80000 100000 Number of Images
Ratio of Sequential To Parallel Code 094 Query Database: Parallel/Total Compute Time for SIFT Samples 093 092 Ratio Par/Seq 091 090 089 SIFT sample data samples avg=09340, R2=283203799077e-05 088 0 50 100 150 200 250 300 Samples 15 / 17
Scaled Speedup Results 30 Query: Total Speedup as a function of Parallel Speedup 25 Total Speedup 20 15 10 5 Parallelism Our Theoretical Speedup (934%) 90% Serial Code 95% Serial Code 97% Serial Code Theoretical Speedup 0 50 100 150 200 Parallel Speedup 16 / 17
17 / 17 Conclusions GPU parallel implementations of lattice hashing accelerates the NN algorithm Required sequential parts of the LSH algorithm will always cause bottlenecks Within the processor count range of some medium embedded applications (UAVs, Cellphones, Surveillance Systems) speedup scales Tweaking parts of the LSH algorithm for the GPU Random Projection Calculation (bulk or per query) Hash list sorting and intersection (per-query) Linear searching of candidate NNs (for exact NN)