Domain-specific Architectures for Emerging Data-Centric Workloads

Size: px

Start display at page:

Download "Domain-specific Architectures for Emerging Data-Centric Workloads"

Leo Archibald Cross
5 years ago
Views:

1 Domain-specific Architectures for Emerging Data-Centric Workloads Kevin Lim November 8, 2012 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

>100MW in 2018 Scale-out software developed

important workloads, must deliver data at

Prineville, OR The Dalles, OR 2 2 Copyright

2 Hyperscale computing driving need for efficiency Web-scale infrastructures with 10,000 s of servers At current trends, large-scale systems are expected to consume >100MW in 2018 Scale-out software developed to meet data needs, leverage infrastructure NoSQL: HBase, Cassandra, Couchbase, MongoDB, NoSQL Distributed application framework: Hadoop, Hive, Pregel, Cloud: Amazon Web Services, Azure, For many important workloads, must deliver data at very low latency Distributed Everything runs from memory in Web 2.0 Evan Weaver application In-memory databases, web search indices framework In-memory distributed caching Cloud Prineville, OR The Dalles, OR 2 2 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

3 SoC-based servers: Reaching a tipping point Aggressive SoCs are largely absent in general purpose CPUs Integration trend: fairly slow over time Caches, Memory controller, Northbridge, GPUs Today: mounting cost & energy pressure in datacenters accelerating need for SoCs Energy fewer pin crossings, more efficient blocks (heterogeneous) Cost fewer sockets, BoM, etc. 3 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

(I/O, management) Optimized in cost, space

4 SoCs enable high-density computing Example: HP s Project Moonshot Redstone platform Calxeda EnergyCore (quad core SoC) 4 SoCs/board x 18 boards = 72 SoCs in 1U, 2880 servers per rack Just the beginning of ultra-dense servers Multiple SoCs per server node ensembles Dense tray-level aggregation, shared everything (I/O, management) Optimized in cost, space and power for scale-out 4 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

5 But data-centric challenges still remain Drive towards low latency Time is money! Data explosion Data growing faster than technology General purpose CPUs are not the answer Ill-matched to server workloads Most of time waiting for data rather than computing Opportunity to specialize for data-centric workloads 5 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

6 How do data-centric workloads access data? Exemplar: In-memory databases Databases create and use an index Data structures for fast data lookup Most often balanced tree or hash table Frequently accessed Hash Table Tree 6 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

7 SoC specialization: Database Indexing Widget Index lookups on general-purpose CPUs: Pointer-intensive core) Time-intensive low IPC (as low as 0.25 on OoO poor energy-efficiency Database Indexing Widget Dedicated hardware for database index lookups Full-service offload: core sleeps when widget runs Up to 65% less energy per query 7 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

8 Outline SoC Motivation Big Data Challenges Indexing in Databases Indexing Widget Results Opportunities and Challenges 8 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Modern databases and indexing Two types of contemporary in-memory databases: Column-store analytical processing Customer Date Product Scale-out transaction processing Customer Date Product Customer

9 Modern databases and indexing Two types of contemporary in-memory databases: Column-store analytical processing Customer Date Product Scale-out transaction processing Customer Date Product Customer Date Product with DSS with OLTP Two fundamental indexing operations Hash table probe Tree traversal 9 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

10 Total Execution Time How much time is spent indexing? Measurement on Xeon 5670 CPU with HW Counters Hash Table Tree Tree Hash Table Hash Table Hash Table 10 0 Payment OLTP Order Status Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Query 2 Query 9 Query 11 Query 17 DSS Indexing can account for up to 73% of execution

Example: Indexing with Tree Traversals SQL : SELECT A_Product,A_Customer FROM A WHERE A_age = 25 Index on A_age Ke y 25 10 8 15 12 25 Tuple Ptr Custome r

11 Example: Indexing with Tree Traversals SQL : SELECT A_Product,A_Customer FROM A WHERE A_age = 25 Index on A_age Ke y Tuple Ptr Custome r Age Date Product Resul t 11 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

12 Example: Indexing with Tree Traversals SQL : SELECT A_Product,A_Customer FROM A WHERE A_age = 25 Index on A_age Ke y Each index traversal : 10K-15K dynamic instructions: lots of pointer chasing 50-60% memory ref. 12 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

13 Outline SoC Motivation Big Data Challenges Indexing in Databases Indexing Widget Results Opportunities and Challenges 13 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

14 Indexing Widget Overview Dedicated offload engine for index lookups Activated on-demand by the core Full-service index lookup Core sleeps when widget runs Widget features Efficient: Specialized control and functional units Low-latency: Caches frequently-accessed index data Tightly-integrated: Uses core s L1-D and TLB 14 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

15 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❶Configur e ❷Run ❸Return Controller (FSM) init. tree hash eval. write end Buffer (SRAM) 15 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

16 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Controller (FSM) init. tree hash eval. write end Computation al Logic Buffer (SRAM) ❶Configur e If (haswidget) { widget.index=&a; widget.key=&b; widget.type=equal; widget.result=&r; widget.data= int; widget.run(); } else { Hashprobe(); } 16 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❷Run Controller (FSM) init.

17 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❷Run Controller (FSM) init. tree hash eval. write end Buffer (SRAM) 17 To/From L1 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❸Return Controller (FSM) init. tree eval.

18 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❸Return Controller (FSM) init. tree eval. end Buffer (SRAM) 18 hash write To/From L1 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Stor &Result e Table, &Result Table, &Result Key Table, Key Key

19 Widget Details Indexing Widget 1) Private L1 Design 2) Shared L2 Design 19 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

20 Methodology First-order analytical model Execution traces: Pin Execution profiling: Vtune, Oprofile 20 Benchmark Applications OLTP: TPC-C on VoltDB DSS: TPC-H on MonetDB Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Model Parameters L1 / L2 / LLC/Off-chip latency: 4 / 6 / 29/ 180 cycles Widget buffer: Fully associative cache Energy Estimations Mcpat

Speedup Performance with Indexing Widget 7 6 5 4 3

8KB 6 5 4 3 6 5 4 3 7 2 1 0 2 1 0 2 1 0 21 Need to

21 Speedup Performance with Indexing Widget Small Tree Large Tree Hash Table 7 512B 1KB 2KB 4KB 8KB Need to accelerate both instructions and memory Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

22 Reduction in Energy (%) Energy Efficiency with Indexing Widget Qry 9 Qry 11 Order S. Qry 2 Payment Qry 17 Reduction over Conventional OoO Reduction over ARM-like* OoO Application Coverage (%) Up to 65% reduction in energy Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

23 Summary Data explosion, business trends call for specialization SoC servers are building block for improved efficiency Even greater rethinking of architectures needed Exemplar data-centric workload: In-memory Database Spends significant time in indexing Mostly pointer chasing: general purpose CPUs are poorly suited Augment CPU with indexing widget Dedicated offload engine: core sleeps when widget runs Improves efficiency: 65% less energy, 6x faster query execution 23 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

24 Opportunities and Challenges Many other classes of data-centric workloads Unstructured text search Key-value caches Automatic document summarization Distributed array processing But market cannot support unlimited specialization Are there key specialization elements that are common? Can we design reusable, purpose-built architectures? How do we provide portable software interfaces? 24 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

26 Invasive Tightly-Coupled Processor Arrays Frank Hannig Hardware/Software Co-Design Department of Computer Science University of Erlangen-Nuremberg, Germany

27 Motivation Power Wall, only a constant power budget is available per device On the other hand, steady demand for more computing power Consequence: energy per operation must decrease These days, energy efficiency (OPS/Watt) is much more important than pure computing power (OPS) This trend goes across all levels, from portable devices such as mobile phones to high-performance computing systems OPS = Operations Per Second Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 2

28 Motivation cont d How can we achieve better energy efficiency? Heterogeneous architectures Multiple cores Domain-specific components, such as Customized instruction set extensions Accelerators (GPUs, FPGAs, dedicated hardware) Customization and heterogeneity are the key to success for future performance gains Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 3

29 Overview Tightly-coupled processor arrays Invasive computing and invasive tightly-coupled processor arrays Power management Tools and application mapping Conclusions Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 4

30 Tightly-coupled processor arrays (TCPAs) Class of massively parallel on-chip processor architectures Highly customizable at synthesis-time Many parameters and configuration options E.g., number of processors, width of data path Flexibility at run-time Programmable, reconfigurable interconnect Used as accelerators in MPSoC or tiled architectures for compute-intensive tasks from domains such as Digital signal processing, audio and video Image processing, object recognition Linear algebra (matrix / vector computations) Cryptography... and other streaming applications Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 5

31 TCPAs cont d Processing elements (s) VLIW architecture Weakly-programmable Small instruction set and memory Small register file No direct main memory access Interconnect Switched connections Single-cycle latency Switching possibilities can be defined at synthesis-time, which enables the configuration of different interconnect topologies at run-time Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 6

TCPAs cont d Local memories structure Reconfigurable buffers at the borders of the array FIFOs at s inputs Feedback shift registers for cyclic data reuse within s (enable efficient handling of loop

32 TCPAs cont d Local memories structure Reconfigurable buffers at the borders of the array FIFOs at s inputs Feedback shift registers for cyclic data reuse within s (enable efficient handling of loop carried data dependencies) Zero-overhead static control flow Multiway branch unit in each Combined with globally generated control predicates Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 7

33 Resource management and mapping Traditional approach Small number of s Single application is running exclusively on array Static mapping onto resources Reality and future Beware of Moore s law Exponential growth Number of s doubles every 18 to 24 months Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 8

Challenges Questions: How to map simultaneously several applications with run-time variant constraints on such architectures? How to manage hundreds to thousands of compute resources?

34 Challenges Questions: How to map simultaneously several applications with run-time variant constraints on such architectures? How to manage hundreds to thousands of compute resources? How to save energy in phases when only less computational power is needed? Can the architecture adapt to temporary (e.g., overheated regions) or permanent faults? Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 9

Invasive computing In general, as defined by Teich: A distributed resource-aware computing methodology that gives each application the ability to explore and claim (invade) available computing

35 Invasive computing In general, as defined by Teich: A distributed resource-aware computing methodology that gives each application the ability to explore and claim (invade) available computing resources, and copy its configuration code to such captured resources (infect), and then to execute the given program in parallel. After finishing a phase of execution, the application may free its previously occupied resources by performing a release operation (retreat). Invasion phases on TCPAs 1. Invade: explores the availability of resources in its neighborhood by sending invade messages 2. Infect: The captured (invaded) s are configured with the application 3. Retreat: Each processing element frees its captured neighbors by sending retreat messages Application Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 10

Invasive TCPAs Each is equipped with an invasion controller (ictrl) to enable fast and decentralized resource management Design choices Different resource exploration methods (invasion strategies):

36 Invasive TCPAs Each is equipped with an invasion controller (ictrl) to enable fast and decentralized resource management Design choices Different resource exploration methods (invasion strategies): Linearly connected region of s Rectangular connected region of s Different designs for invasion controller: FSM-based Programmable Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 11

37 Invasion strategies Linear invasion, different polices Capturing of s in straight-walk fashion with maximal length of walks Selection of an available neighbor in a random-walk fashion Capturing of s in a meander-walk fashion Straight policy Random policy Meander policy Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 12

38 Invasion strategies cont d Rectangular invasion Linear invasion in one dimension Combined with parallel invasion in the second dimension Results Linear invasion takes only 2 cycles per ; due to parallel execution, rectangular invasion is even faster Obviously, the meander policy for linear invasion is preferable since it defragments the array least Compared with centralized software based resource management, resource exploration and reservation can be achieved 50 times faster Moderate hardware cost of invasion controller ( LUTs in Virtex6 FPGA technology) Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 13

Invasion controller (first level of power management) Processing element (second level of power

39 Power management in TCPAs Invasive computing as enabler for power management How to save static power? Power on invaded s Power off idle s Hierarchical power management strategy Separate power domains for: Invasion controller (first level of power management) Processing element (second level of power management) Processing Element Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 14

PMU PMU PMU PMU PMU PMU PMU PMU Hierarchical power management Invade phase t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p

40 PMU PMU PMU PMU PMU PMU PMU PMU Hierarchical power management Invade phase t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p OFF ) delay t(p OFF ) delay t(p OFF ) delay t(p OFF ) delay Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Powered ON Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Powered OFF Invade Invade Claim Retreat phase works similar PMU : Power Management Unit Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 15

41 PMU PMU PMU PMU PMU PMU PMU PMU Implementation options Inv. Ctrl Inv. Ctrl Multiple Invasion Controllers are grouped into a single power domain to reduce area and invasion time Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 16

42 Implementation options cont d PMU PMU Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl PMU PMU Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 17

43 Implementation options cont d PMU PMU Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 18

44 Area (million NAND2 eq. gates) Power management results Depending on array utilization, a reduction of leakage power between 30%-90% can be achieved Trade-off: power, area, and invasion latency per for different power domain groupings Inv. Ctrl/domain Inv. Ctrl/domain 8 Inv. Ctrl/domain without Power Gating 2 Inv. Ctrl/domain Avg. invasion latency / (# cycles) Power: size and color of the bubble Experimental setup: - 10x12 TCPA - TSMC-65nm-LP - Synopsys tool chain - PVT corner: fast_1.32v_125c [max leakage corner] - post synthesis results Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 19

45 Tool suite Integrated development environment Graphical design entry, i.e., parameterization of hundreds of design options; alternatively a powerful ADL (architecture description language) can be used Push-button generation of Fast cycle-accurate simulator, can be wrapped into SystemC for co-simulation Synthesizable HDL code Configuration streams Structural assembly code entry and intuitive (graphical) interconnect setup Back annotation and visualization of simulation results offer rich debugging possibilities Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 20

Application mapping High-level programming environment LoopInvader Input: DSL (domain-specific language) PAULA Functional language Domain of multi-dimensional affine recurrence

Loop Unrolling Partitioning Expression Splitting Affine Transformations.

46 Application mapping High-level programming environment LoopInvader Input: DSL (domain-specific language) PAULA Functional language Domain of multi-dimensional affine recurrence equations defined over polyhedral iteration spaces Multi-dimensional reductions Algorithm (DSL PAULA) High-Level Transformations Localization Loop Perfectization Output Normal Form Loop Unrolling Partitioning Expression Splitting Affine Transformations... Allocation Space-Time Mapping Scheduling Resource Binding Code Generation VLIW Code for each Configuration of Interconnect Code of Controller TCPA Configuration Simulation Architecture Model Simulation TCPA Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 21

47 Application mapping cont d DSL PAULA example, discrete 2-D Gauss window filter... par (x >= 0 and x < 1280 and y >= 0 and y < 1024) { w[0,0]=1; w[0,1]=2; w[0,2]=1; w[1,0]=2; w[1,1]=4; w[1,2]=2; w[2,0]=1; w[2,1]=2; w[2,2]=1; } h[x,y] = SUM[i>=0 and i<=2 and j>=0 and j<=2] (pic_in[x+i,y+j] * w[i,j]); pic_out[x,y] = h[x,y] >> 4; // divided by 16 pic out x, y = pic x i, y j w i, j i 0 2 j 0 in Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 22

48 Application mapping cont d Loop partitioning is used for parallelization and mapping dependence graph architecture Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 23

Application mapping cont d Current research: Symbolic loop parallelization and mapping Symbolic partitioning Symbolic scheduling Symbolic control generation Replace at run-time only the symbols in

49 Application mapping cont d Current research: Symbolic loop parallelization and mapping Symbolic partitioning Symbolic scheduling Symbolic control generation Replace at run-time only the symbols in the configuration data according to the available resources Algorithms can be adapted quickly at run-time without recompilation Self-adaption enables fast reaction on QoS parameters, system load, failures, etc. Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 24

50 Conclusions Invasive Tightly-Coupled Processor Arrays: a domain-specific remedy for the arising challenges in multi-processor architectures Hardware-support for ultra-fast resource reservation Invasive computing as an enabler for power management Careful co-design of domain-specific architectures, domain-specific languages, and mapping tools is needed in order to achieve energy efficient solutions, while simultaneously ensuring high productivity (time-to-market) as well as portability and adaptability Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 25

Further information Invasive Tightly-Coupled Processor Arrays Contact Frank Hannig Hardware/Software Co-Design, Department of CS University of Erlangen-Nuremberg Cauerstr.

51 Further information Invasive Tightly-Coupled Processor Arrays Contact Frank Hannig Hardware/Software Co-Design, Department of CS University of Erlangen-Nuremberg Cauerstr. 11, Erlangen, Germany Phone: Acknowledgements Contributors (in alphabetical order): Srinivas Boppu, Frank Hannig, Dmitrij Kissler, Alexej Kupriyanov, Vahid Lari, Shravan Muddasani, Alex Tanase, Jürgen Teich This work is supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre Invasive Computing (SFB/TR 89) Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 26

52 References [1] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich, A highly parameterizible parallel processor array architecture, in Proc. of the IEEE Int. Conf. on Field Programmable Technology (FPT), pp , Bangkok, Thailand, Dec [2] J. Teich. Invasive Algorithms and Architectures. Journal it Information Technology, 50 (2008), pp , [3] J. Teich, J. Henkel, A. Herkersdorf, D. Schmitt-Landsiedel, W. Schröder- Preikschat, and G. Snelting, Invasive Computing: An Overview, in Multiprocessor System-on-Chip Hardware Design and Tool Integration, Springer, Berlin, 2011, pp [4] V. Lari, A. Narovlyanskyy, F. Hannig, and J. Teich, Decentralized Dynamic Resource Management Support for Massively Parallel Processor Arrays, In Proc. of the 22nd IEEE Int. Conf. on Application-specific Systems, Architectures, and Processors (ASAP), pp , Santa Monica, CA, USA, Sep [5] D. Kissler, D. Gran, Z. Salcic, F. Hannig, and J. Teich, Scalable Many-Domain Power Gating in Coarse-grained Reconfigurable Processor Arrays, IEEE Embedded Systems Letters, 3(2):58-61, [6] V. Lari, S. Muddasani, S. Boppu, F. Hannig, M. Schmid, and J. Teich, Hierarchical Power Management for Adaptive Tightly-Coupled Processor Arrays, To appear in ACM Transactions on Design Automation of Electronic Systems (TODAES), Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 27

Dark Silicon Accelerators for Database Indexing

Dark Silicon Accelerators for Database Indexing Onur Kocberber, Kevin Lim, Babak Falsafi, Partha Ranganathan, Stavros Harizopoulos Dark Silicon and Big Data Challenges Data explosion Data growing faster