A Hybrid Implementation of Hamming Weight

Size: px

Start display at page:

Download "A Hybrid Implementation of Hamming Weight"

Roxanne Mason
5 years ago
Views:

1 A Hybrid Implementation of Hamming Weight Enric Morancho Computer Architecture Department Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain 22 nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Torino, Italy, Feb. 12 nd 14 th, 2014 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

2 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

3 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

4 Introduction What is hamming weight? The hamming weight of a bitstring is the number of bits set to one in the bitstring Hamming weight is also known as population count, sideways addition or bit counting Applications: cryptography, chemical informatics, information theory Bitstring lengths up to several thousands of bits Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

5 Introduction Algorithms for computing hamming weight Several algorithms have been proposed: Naïve, memoization, parallel reduction, merged parallel reduction, bitslicing,... Some algorithms admit both scalar and vector implementations However, the existing implementations expose either scalar parallelism or vector parallelism. This work proposes an hybrid scalar-vector implementation Exposes both parallelisms simultaenously Useful on platforms that can exploit both parallelisms simultaneously Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

6 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

7 Existing algorithms Naïve Iterates through the bits of the bitstring and accumulates each bit value Can be specialized to deal with sparse/dense bitstrings Poor performance due to not exploiting parallelism uint8_t hw_naive(uint32_t w) { uint8_t i, cnt=0; } for (i=0; i<32; i++, w = w>>1) cnt += w&0x1; return(cnt); Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

8 Existing algorithms Memoization Steps: Defines a subword size (e.g. 8 bits) Precomputes the hamming weight of all possible subwords Looks up the precomputacion table for each subword of the bitstring and accumulates the results Admits both scalar and vector implementations Exposes more parallelism than naïve implementation uint8_t T8[256] = {0, 1, 1, 2,..., 7, 8}; uint8_t hw_memoization8(uint32_t w) { return(t8[w&0xff] + T8[(w>>8)&0xFF] + T8[(w>>16)&0xFF] + T8[w>>24]); } Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

9 Existing algorithms Parallel reduction at bit level Tree reduction of the input word in log 2 bits per word levels. Input Parallel reduction: level Parallel reduction: level Parallel reduction: level Admits both scalar and vector implementations uint32_t hw_parallel(uint32_t w) { w = (w & 0x ) + ((w>> 1) & 0x ); /*Lev. 1*/ w = (w & 0x ) + ((w>> 2) & 0x ); /*L2*/ w = (w & 0x0F0F0F0F) + ((w>> 4) & 0x0F0F0F0F); /*L3*/ w = (w & 0x00FF00FF) + ((w>> 8) & 0x00FF00FF); /*L4*/ w = (w & 0x0000FFFF) + ((w>>16) & 0x0000FFFF); /*L5*/ return(w); } Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

10 Existing algorithms Merged parallel reduction (or tree merging) Deals with bitstrings larger than a word Merges the intermediate results of several parallel reductions keeps processing just the combined result. The degree of merging is limited by the widths of the accumulators Admits both scalar and vector implementations Example: merged parallel reduction of 3 words (wa wb bc) wa = (wa & 0x ) + ((wa>> 1) & 0x ); /*L1*/ wb = (wb & 0x ) + ((wb>> 1) & 0x ); wa = wa + ( wc & 0x ); wb = wb + ((wc>>1) & 0x ); wa = (wa & 0x ) + ((wa>> 2) & 0x ); /*L2*/ wb = (wb & 0x ) + ((wb>> 2) & 0x ); wa = wa + wb; wa = (wa & 0x0F0F0F0F) + ((wa>> 4) & 0x0F0F0F0F); /*L3*/... Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

Existing algorithms Bitslicing Transforms a (2 n 1)-word bitstring into n words, preserving indeed the hamming weight of the original bitstring.

11 Existing algorithms Bitslicing Transforms a (2 n 1)-word bitstring into n words, preserving indeed the hamming weight of the original bitstring. The implementation relies on the parallel emulation of bits_per_word bit adders by using bit-wise logical instructions. Admits both scalar and vector implementations 2 n 2 i=0 n 1 hw(w i ) = 2 j hw(s j ) Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35 j=0

12 Existing algorithms Processor support Some processors offer a machine instruction to compute the hamming weight of a machine word For instance: Mark II (1954), IBM Stretch (1961), CDC 6600 (1964), Cray 1 (1976), Sun SPARCv9 (1995), Alpha 21264A (1999), IBM Power5 (2004) and ARM Cortex-A8 (2005) Since 2007, x86 processors supporting SSE4.2 offer popcnt instruction Computes the hamming weight of a scalar 32-bit or a 64-bit register AMD 15h Intel Nehalem Sandy Bridge/Haswell 32-bit 64-bit 32/64 bit Latency (cycles) Dispatch rate (inst/cyc) Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

13 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

14 Evaluation of existing implementations Evaluation environment Our benchmark consists in computing the hamming weight of several randomly initialized bitstrings Bitstring words are located in consecutive memory locations We evaluate two scenarios: Uncached Cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

15 Evaluation of existing implementations Evaluation environment Intel Core Intel Xeon i5-650 E5-2630L Microarchitecture Nehalem Sandy Bridge Frequency (max turbo) 3.2(3.46) GHz 2(2.5) GHz Cores 2 6 Reorder Buffer entries 128 µ-ops 168 µ-ops Scheduler entries 36 µ-ops 54 µ-ops Peak dispatch rate 6 µ-ops/cycle Size and assoc. 32KB, 8-way, 64Byte lines DL1 Bandwidth 128 bits/cycle 256 bits/cycle In-flight loads Simult. misses 10 L2 256KB, 8-way, 64Byte lines L3 4MB, 16-way, 64B 15MB, 20-way, 64B Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

16 Evaluation of existing implementations Evaluated implementations Naïve Mem-8 Mem-16 Par.Red. SSE4.2 Single-word wide implementations hw_naive implementation Memoization, 2 8 -entry lookup table Memoization, entry lookup table Parallel reduction at bit level over 64-bit words Uses 64-bit scalar instruction popcnt Multi-word wide implementations Merged Scalar merged par.red. on bit words at level 3 Merged-V Vector merged par.red. on bit words at level 3 (SSE2) Slice Scalar bit slicing on 7 64-bit words Slice-V Vector bit slicing on bit words (SSE2) Mem-4 Vector memoization, 2 4 -entry lookup table (SSSE3) Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

17 Evaluation of existing implementations Results on Nehalem platform: single-word wide/cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

18 Evaluation of existing implementations Results on Nehalem platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

19 Evaluation of existing implementations Results on Sandy Bridge platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

20 Evaluation of existing implementations Results SSE4.2 performs best Multi-word wide implementations outperform single-word implementations (but SSE4.2) Vector implementation outperform scalar implementation of the same algorithm Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

21 Evaluation of existing implementations Conclusions Although scalar SSE4.2 implementation performs best... The dispatch rate of popcnt instruction is just 1 inst./cycle, that is, SSE4.2 s peak performace is 8 bytes/cycle But DL1 bandwidht is 16 bytes/cycle (Nehalem) and 32 bytes/cycle (Sandy Bridge) SSE4.2 implementation is fully scalar and can not exploit the unused dispatch ports to dispatch vector instructions We wonder if SSE4.2 implementation may be outperformed by a hybrid implementation that makes use of both vector and scalar instructions Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

22 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

23 Proposed hybrid implementation Design Main idea: combining SSE4.2 (scalar) and Mem-4 (vector) implementations into a hybrid implementation Distribute the bitstring words into the scalar and the vector functional units Steps Iterate through the bitstring, each loop iteration processes a fixed sized chunk Statically distribute the chunk bytes between the scalar and vector functional units Design-space dimensions: Number of chunk bytes processed by the scalar units (S) Number of chunk bytes processed by the vector units (V) Design-space exploration Configurations (S,V) with chunk-length up to 80 bytes (16,16), (32,16), (16,32), (48,16), (32,32), (16,48), (64,16), (48,32), (32,48) and (16, 64). Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

24 Design-space exploration Nehalem platform: uncached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

25 Design-space exploration Nehalem platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

26 Design-space exploration Sandy Bridge platform: uncached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

27 Design-space exploration Sandy Bridge platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

28 Design-space exploration Conclusions Some hybrid configurations outperform SSE4.2 Performance potential is bigger in Sandy Bridge than in Nehalem The best hybrid configuration depends on the bitstring length However, we pick only one configuration for each platform: (32,32) -Nehalem- and (32,48) -Sandy Bridge- Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

29 Results Sandy Bridge platform: uncached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

30 Results Sandy Bridge platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

31 Results Sandy Bridge platform Speedup of (32,48) hybrid configuration with respect to SSE4.2 Bitstring length up to DL1 up to L2 up to L3 >L3 Uncached scenario Cached scenario Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

32 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

33 Conclusions Processors can exploit both scalar and vector parallelism but applications expose only one kind of parallelism Some processor resources are not fully exploited Applications that admit both scalar and vector implementations, may benefit from a hybrid implementation that exposes both kinds of parallelism simultaneously Case of study: hamming weight (32,48) hybrid configuration outperforms the, to the best of our knowledge, best implementation of hamming weight by up to 1.22X on Sandy Bridge platform Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

34 Future work Evaluating this technique on newer platforms (e.g. Haswell) AVX2: vector integer intructions, 256-bit vector registers Applying this technique to other problems Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

35 A Hybrid Implementation of Hamming Weight Enric Morancho Computer Architecture Department Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain 22 nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Torino, Italy, Feb. 12 nd 14 th, 2014 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

A Hybrid Implementation of Hamming Weight

2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing A Hybrid Implementation of Hamming Weight Enric Morancho Departament d Arquitectura de Computadors Universitat