An MPSoC for Energy-Efficient Database Query Processing

Size: px

Start display at page:

Download "An MPSoC for Energy-Efficient Database Query Processing"

Marylou Young
5 years ago
Views:

1 Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis An MPSoC for Energy-Efficient Database Query Processing TensilicaDay 2016 Sebastian Haas Emil Matúš Gerhard Fettweis

System Requirements Throughput Latency Energy MPSoC

2 Introduction Challenges 5G Mobile Networks Mobile Edge Computing Tactile Internet Internet of Things Database System Requirements Throughput Latency Energy MPSoC Application-Specific Customizable Energy-Efficient Slide 2

Processing Big Data Query Throughput Direct interconnection

3 Database Systems SQL (Structured Query Language): BEGIN FOR X IN Loop SELECT * WHERE data=x END Loop END; Query Processing Big Data Query Throughput Direct interconnection to user and storage Query Latency Database Accelerator (DBA) Slide 3

4 Base Processor Tensilica Xtensa LX5 RISC Processor Local RAM XLMI 2x 32kB data 1x 32kB instruction 1 Load-Store unit (LSU) 32 bit data 32 bit instruction Slide 4

5 Extended Processor Tensilica Xtensa LX5 RISC Processor Database-Specific Instruction Set Local RAM XLMI 2x 32kB data 1x 32kB instruction 2 Load-Store units (LSU) 2x 128 bit data 64 bit FLIX 64 bit instruction Data Prefetcher Slide 5

DBA Applications Merge Sort Intersection 4 2 9 10 15 2 11 14 2 4 9 10 2 15 11 14 2 4

Bit) Shuffle Network Result (n Bit) 1 3 0 0 3 Bitmap Index Compression 0 1 2 3 0 1 0

6 DBA Applications Merge Sort Intersection Hashing <32 Bit Key> Bit Selection (32 Bit n Bit) Shuffle Network Result (n Bit) Bitmap Index Compression *0 1*1 5*0 1*0 2*1 4*0 1*1 1*0 2*0 1*1 Slide 6

7 Hashing: TIE Development unsigned int hash, shval, shval_neg; unsigned int mask = 0xFFFFFFFF; //init pointer, variables init_states(key, hashvalue, hashfunc); for(i=0; i<keysize; i++){ //load key, bit selection hash = key[i] & hashfunc; //extract bits for(j=30; j>=0; j--){ if(!(hashfunc & (0x1<<j))){ //partial shift right shval = hash & (mask<<j); shval_neg = hash & ~(mask<<j); hash = (shval>>1) shval_neg; } } //store hash value hashvalue[i] = hash; } Pure C code LD_0(); LD_1(); //load keys, extract bits, store hash values for(i=0; i<(keysize/16); i++){ LD_0(); LD_1(); HOP(); LD_0(); LD_1(); HOP(); ST_0(); ST_1(); } HOP(); ST_0(); ST_1(); C code with TIE instructions 1 cycle 1 cycle 1 cycle Slide 7

8 Hashing: Instruction Flow Local Data Memory 0 Dataflow HOP LD_0 Hash Func Load-Store Unit 0 Key_0 Key_1 Key_2 Key_3 HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. ST Result_0 Result_1 Result_2 Result_3 Result_4 Result_5 Result_6 Result_7 Load Execution Dataflow LD_1 Key_4 Key_5 Key_6 Key_7 Load-Store Unit 1 Load Local Data Memory 1 Slide 8

9 Hashing: Pipeline Snippet Cycle n Cycle (n+1) Cycle (n+2) Cycle (n+3) Cycle (n+4) Cycle (n+5) Cycle (n+6) Cycle (n+7) Cycle (n+8) ST_0 HOP ST_1 LD_0 LD_1 HOP ST_0 LD_0 LD_1 HOP LD_0 LD_1 ST_1 HOP ST_0 Latency: 6 cycles LD_0 LD_1 HOP ST_1 LD_0 LD_1 HOP LD_0 LD_1 Slide 9

Hashing: Results Final processor +1 Load-Store unit (2x) + TIE instructions (500x) Data bus: 32 128

10 Hashing: Results Final processor +1 Load-Store unit (2x) + TIE instructions (500x) Data bus: bit (2x) Throughput TT = nn kkkkkk tt n key : number of keys t: time to perform the operation Slide 10

11 Tomahawk Concept SW Application Start S1 T1 if S2 S3 T2 T5 Control Flow Control-Plane Subsystem Cache Application Processor Global Memory Core Manager Data-Plane Subsystem Local Memory PE1... Local Memory PEn T3 T4 End Data Flow... S1 S2 S2 S3 T1 T2 T3 T4 T2 T3 T4 T5 T3 T3 T2 PE3 PE2 T1 T2 T4 T4 T5 PE1 Dynamic out-of-order task dispatching Slide 11

12 Tomahawk3 MPSoC 28 nm CMOS SLP Globalfoundries Die Size: 18 mm² Tape-Out: Oct Core heterogeneous MPSoC (2 core types) Network-on-Chip: High-Speed Serial Data Link with 2 Routers Power Management: DVFS, AVS 2x LPDDR2 Memory Interface: 2x 64 MB SDRAM Processing Elements: Tensilica Xtensa LX5 Extended Instruction Set for Database Applications CoreManager: Optimized for query processing Tensilica Xtensa LX5 Slide 12

13 Single Core Performance Slide 13

14 System Level CoreManager Processing Element DMAC Core Global Memory Init Data Trf.: READ Start Data Transfer Data Transfer READ Data Transfer Start Core End Data Transfer Init Data Trf.: WRITE Start Data Transfer Data Transfer End Data Transfer Processing Core finished WRITE Data Transfer Slide 14

15 Scan Benchmark Scan operation scans data set with respect to a reference element (Filtering, Equality Search) Result of comparison is one bit TIE Instructions available for 4, 8, 16, and 32-bit input values Example: Input set with 8-bit elements Result set with single bits Scan reference element: Advantages High instruction level parallelism High data level parallelism Slide 15

Scan Benchmark: Results Throughput [Gbit/s] 1000 100 10 1 250 MHz: 500 MHz: Performance RISC RISC ASIP ASIP 100x 4x Power [mw] 1000 100 10 Power Consumption 250 MHz: 500 MHz: RISC RISC ASIP

16 Scan Benchmark: Results Throughput [Gbit/s] MHz: 500 MHz: Performance RISC RISC ASIP ASIP 100x 4x Power [mw] Power Consumption 250 MHz: 500 MHz: RISC RISC ASIP ASIP Number of Cores Number of Cores Speedup due to: Core Extensions/TIE: 100x 1 core 4 cores: 4x Power increase due to: Core Extensions/TIE: 10mW 1 core 4 cores: 190mW Slide 16

17 Scan Benchmark: Results Energy Efficiency [pj/bit] MHz: 500 MHz: Energy Efficiency RISC RISC ASIP ASIP 4x Energy decrease due to: Core Extensions/TIE: 100x 1 core 4 cores: negligible 500MHz 250 MHz 4x Number of Cores Slide 17

18 Benchmark Comparisons App. Scenario Scan WAH Indexing Tomahawk3 2x Intel Xeon E [1] 2x Intel Xeon E5430 [2] Tomahawk3 Intel i7-2600k [3] NVIDIA GTX-670 [3] Processing Cores Clock Freq. [GHz] Data Deposit Loc. Mem DRAM Cache DRAM DRAM Loc. Mem DRAM DRAM DRAM Total Throughput [Gbit/s] Total Power [W] DRAM Power [W] Power/Throughput [nj/bit] References: [1] F. Fusco, et al., Indexing Million of Packets per Second using GPUs, In Proceedings of the 2013 Conference on Internet Measurement Conference (IMC'13), [2] I. Psaroudakis, T. Kissinger, D. Porobic, T. Ilsche, E. Liarou, P. Tözün, A. Ailamaki, and W. Lehner. Dynamic fine-grained scheduling for energy-efficient main-memory queries. In Proceedings of the Tenth International Workshop on Data Management on New Hardware, DaMoN'14, [3] D. Tsirogiannis, S. Harizopoulos, and M. A. Shah. Analyzing the energy efficiency of a database server. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD'10, Slide 18

the Chair for Highly-Parallel VLSI-Systems and Neuromorphic

19 Tomahawk3 Demonstrator Acknowledgements: We would like to thank Cadence and Tensilica for providing software tools an IP as well as the Chair for Highly-Parallel VLSI-Systems and Neuromorphic Circuits for Backend-Design and PCB development of the Tomahawk3 chip. Slide 19

20 Thank you! References: [1] O. Arnold, S. Haas, G. Fettweis, B. Schlegel, T. Kissinger, W. Lehner: An Application-Specific Instruction Set for Accelerating Set-Oriented Database Primitives, SIGMOD [2] O. Arnold, S. Haas, G. Fettweis, B. Schlegel, T. Kissinger, T. Karnagel, W. Lehner: HASHI: An Application-Specific Instruction Set Extension for Hashing, ADMS [3] B. Nöthen et al.: A 105GOPS 36mm2 Heterogeneous SDR MPSoC with Energy-Aware Dynamic Scheduling and Iterative Detection-Decoding for 4G in 65nm CMOS. ISSCC 2014 [4] O. Arnold, E. Matus, B. Nöthen, M. Winter, T. Limberg and G. Fettweis: Tomahawk - Parallelism and Heterogeneity in Communications Signal Processing MPSoCs. TECS 2013 Slide 20

Overview on Hardware Optimizations for Database Engines

Overview on Hardware Optimizations for Database Engines Annett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner BTW 2017, Stuttgart, Germany, 2017-03-09