Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen

Size: px

Start display at page:

Download "Exercise Session 6. Data Processing on Modern Hardware L Fall Semester Cagri Balkesen"

Verity Underwood
6 years ago
Views:

1 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Exercise Session 6 Data Processing on Modern Hardware L Fall Semester 2012 Cagri Balkesen cagri.balkesen@inf.ethz.ch Department of Computer Science ETH Zurich, Switzerland 25 October 2012

2 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Flynn s Taxonomy Computer architecture classification according to Flynn [Fly72]: SMD: nstructions Control Unit SSD: Single nstruction Single Data Stream Data Processing Unit #4 Result SMD: Single nstruction Multiple Data Streams Data Processing Unit #3 Result MSD: Mutiple nstruction Single Data Stream Data Processing Unit #2 Result MMD: Multipe nstruction Multiple Data Streams Data Processing Unit #1 Result

Cagri Balkesen Data Processing on Modern Hardware Exercises Fall 2012 3 Early SMD Machines CDC STAR-100 Released 1974 Vector super computer supporting memory-to-memory vector

3 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Early SMD Machines CDC STAR-100 Released 1974 Vector super computer supporting memory-to-memory vector operations Cray X-MP/28 (CAB) ntroduction 1982 Word length: 64 bit Memory: 8M words Register-to-register vector operations 8 vector registes with up to 64 words each ETH Cray X-MP/28 (CAB)

4 - - VO Variable Vector Length on a Cray X-MP Example: Vector integer addition of first 53 elements of two vector registers V1, V2, V3: Vector Register V3[0:52] V1[0:52] + V2[0:52] Line nstr. Description 1 A1 53 Set addr. reg. A1 to 53 2 VL A1 Set vector length to 53 3 V3 V1+V2 Perform addition Latency of add: VL+8 = 61 cycles (1 cycle 9.5 ns) Basic Operation of the Vector Section Vector Registers i=. [[ v7 V6 [ v5! v4 ((A0)+(Ak)) v3 V2 ((A0)+(Ak)) Vl -- ((A0)+(Ak)) o" Vector Control t Vector Length ] Ai source [RR89] Vector Control t Vector Mask Ak 4 Vj Vk Vi Vector Functional Units F Ak ll [ Shift ].T s,, S-! Si Sj Sh ' ~ Pop/Parity!, F~ll Vectorilt ' i Logi~, ' -'~ S~ nd ll" Logical U ' Add Floating Point Functional Units V. - -1, ~oipro--l[ V~l ] Approx. [.' ] Multiply ] ] Vs Ski! Add ~ i ii The Vector Pop/Paxity unit The Second Vector Logical unit shares shares its input path with the both its input and output paths with [ Reciprocal Approximation unit. the Floating Point Multiply unit. [ i i Figure 5.1. The vector section of the Cray X-MP. Cagri Balkesen Data Processing on Modern Hardware Exercises Fall

5 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Vector Units in Modern CPUs PowerPC AltiVec (Motorola/Freescale) VMX (BM), Velocity Engine (Apple) PowerPC G4, G5 and Cell BE bit vector registers Dedicated vector unit UltraSPARC AMD VS: Visual nstruction Set Uses 64-bit FPU registers 3D Now! (since AMD K6-2) nteger + single precision floating point ntel MMX Since Pentium MMX 8 64-bit registers (alias to FPU stack) Before Pentium no vector unit SSE SSE4 ntroduced in Pentium Dedicated vector unit, combined with MMX AVX 256-bit registers 3 operand, non-destructive instructions

6 ntel SSE SSE SSE2 Since Pentium bit vector registers xmm0,..., xmm7 Single precision FP Dedicated vector unit but shared resources with FPU Since Pentium V Double precision FP Extends MMX registers to 128 bit Full integer support on XMM registers (without using MMX registers) SSE3 SSSE3 SSE4 x86-64 AVX Since Pentium V (Prescott) New horizontal operations Since Core 2 (Merom) New permutaiton instructions Since Core 2 (Penryn) Dot product Adds additional vector registers xmm8,..., xmm bits ymm8,..., ymm15 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall

7 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall SSE2 Vector Registers nteger Data Types 128 bit vector register (16 bytes) 16 byte elements b 15 b 14 b 13 b 12 b 11 b 10 b 9 b 8 b 7 b 6 b 5 b 4 b 3 b 2 b 1 b 0 8 word elements w 7 w 6 w 5 w 4 w 3 w 2 w 1 w 0 4 double word elements dw 3 dw 2 dw 1 dw 0 2 quad word elements qw 1 qw 0 Floating Point Data Types 4 float elements f 3 f 2 f 1 f 0 2 double elements d 1 d 0

8 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Assignment: Warm-up Exercise Download skeleton C code from course website compareandcount.c does the following: Array is filled with 100 mio numbers [0, 99] Counts how many values > 42 mplement SMD-accelerated version using SSE ntrinsics Measure speed-up Caveats: Result of SMD-comparison 0 if true but all 1 s if false n order to count, either shift or exploit that = 1

9 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Assignment: Column Value De(compression) n file compression.c the following compression is implemented 32-to-8 bit 32-to-9 bit 32-to-7 bit Use C macros to switch between versions Serial decompression is implemented Execution time is measured and decompressed values validated mplement the following functions using SSE ntrinsics SMD_decompress8to32(... ) SMD_decompress9to32(... ) SMD_decompress7to32(... )

10 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Decompression Step 1: Copy Values Step 1: Bring data into proper 32-bit words: v 13 v 12 v 11 v 10 v 9 v 8 v 7 v 6 v 5 v 4 v 3 v 2 v 1 v 0 shuffle mask FF FF 4 3 FF FF 3 2 FF FF 2 1 FF FF 1 0 v 3 v 2 v 1 v 0 Use shuffle instructions to move bytes within SMD registers. m128i out = _mm_shuffle_epi8 (in, shufmask);

11 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Decompression Step 2: Establish Same Bit Alignment Step 2: Make all four words identically bit-aligned: v 3 v 2 v 1 v 0 3 bits 2 bits 1 bits 0 bits shift 0 bits shift 1 bits shift 2 bits shift 3 bits v 3 v 2 v 1 v 0 3 bits 3 bits 3 bits 3 bits SMD shift instructions do not support variable shift amounts!

12 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Decompression Step 3: Shift and Mask Step 3: Word-align data and mask out invalid bits: v 3 v 2 v 1 v 0 v 3 v 2 v 1 v 0 v 3 v 2 v 1 v 0 m128i shifted = _mm_srli_epi32 (in, 3); m128i result = _mm_and_si128 (shifted, maskval);

13 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Useful collection of SSE ntrinsics ntrinsic Function _mm_loadu_si128(src) _mm_storeu_si128(dest, reg) _mm_set_epi32(v0,v1,v2,v3) _mm_set_epi8(v0,...,v15) _mm_cmpgt_epi32(reg1, reg2) _mm_add_epi32(reg1, reg2) _mm_and_si128(reg, mask) _mm_mullo_epi32(reg1, reg2) _mm_extract_epi32(reg, pos) _mm_shuffle_epi32(reg, mask) _mm_srli_si128(reg, bytecnt) _mm_slli_si128(reg, bytecnt) _mm_srli_epi32(reg, bitcnt) _mm_slli_epi32(reg, bitcnt) Description Load data from memory into register Store data back in memory Load four 32-bit integers into register Load sixteen 8-bit integers into register Greater compare of the four 32-bit values in the registers Addition of the four 32-bit values in the registers Bitwise and of two registers masking Multiplication of the four 32-bit values in the registers Extract 32-bit integer at position pos form register Shuffle 32-bit integers according to the shuffle mask Shift entire register right (bytewise) Shift entire register left (bytewise) Shift 32-bit integers in register to the right (bitwise) Shift 32-bit integers in register to the left (bitwise) Rerfer to for details.

14 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Auto-Vectorization in gcc Recent versions of gcc can auto-vectorize C code Use command line options: -ftree-vectorize turn on auto-vectorization (default for -O3) -ftree-vectorizer-verbose=x set reporting verbosity level of vectorizer -msse to generate SSE code -msse2 to generate SSE2 code -msse3 to generate SSE3 code gcc does not vectorize if code contains braches. unconstrained pointers are used (aliasing). uncountable loops.

15 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Auto-Vectorization of Volcano Column-Store mplementation of artithmetic operator in engine.c: int next_arithop(int32_t *tuples, arithop_t *op) {... switch (op->operation) { case ADD: for (n=0; n<minnum; n++) tuples[n] = op->left_input[n]+ op->right_input[n]; break; case SUB: for (n=0; n<minnum; n++) { tuples[n] = op->left_input[n]- op->right_input[n]; break;... Compilation: $ gcc -m64 -O3 -msse2 -ftree-vectorize -ftree-vectorizer-verbose=3 -c engine.c... engine.c:104: note:not vectorized: unhandled data-ref engine.c:109: note:not vectorizer: unhandled data-ref... engine.c:95: note: vectorized 0 loops in function. Auto-vectorization failed op pointer, possible aliasing? restrict pointer

16 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Vectorized Volcano Column-Store using SSE Using restricted pointers in loops: int next_arithop(int32_t *tuples, arithop_t *op) {... int32_t* restrict l=op->left_input; int32_t* restrict r=op->right_input;... switch (op->operation) { case ADD: for (n=0; n<minnum; n++) tuples[n] = l[n]+r[n]; break; case SUB: for (n=0; n<minnum; n++) { tuples[n] = l[n]-r[n]; break;... Compilation: $ gcc -m64 -O3 -msse2 -ftree-vectorize -ftree-vectorizer-verbose=3 -c engine.c... engine.c:113: note: Alignment of access forced using peeling. engine.c:113: note: LOOP VECTORZED engine.c:121: note: Alignment of access forced using peeling. engine.c:121: note: LOOP VECTORZED... engine.c:102: note: vectorized 5 loops in function. successful vectorization

17 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Speedup SSE Vectorization of Volcano Column-Store Query: SELECT sum(orderkey+linenumber*shipdate) FROM lineitems data set: 6 million rows CPU: Core 2 Quad Q GHz non-vectorized ms gcc autovectorization ms Speedup = 1.10 Load is memory-bound, not CPU-bound.

18 Cagri Balkesen Data Processing on Modern Hardware Exercises Fall Hand in your Results to Subject = DPMH:assignment4 {your netzname} Body = Description of CPU you tested on, e.g., Xeon QuadCore L5520,( 4x 2267MHz) Attach plots + raw data Attach source code (optional)

19 References [Fly72] [RR89] Michael J. Flynn. Some computer organizations and their effectiveness. EEE Transactions on Computers, 21(9): , September Kay A. Robbins and Steven Robbins. The Cray X-MP/Model 24, chapter 5, pages Springer LNCS, [WPB + 09] Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. [ZR02] Simd-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB, 2(1): , Jingren Zhou and Kenneth A. Ross. mplementing database operations using SMD instructions. n SGMOD 02, pages , Madison, Wisconsin, USA, Cagri Balkesen Data Processing on Modern Hardware Exercises Fall

Dan Stafford, Justine Bonnot

Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing