Domain-specific Architectures for Emerging Data-Centric Workloads

Size: px
Start display at page:

Download "Domain-specific Architectures for Emerging Data-Centric Workloads"

Transcription

1 Domain-specific Architectures for Emerging Data-Centric Workloads Kevin Lim November 8, 2012 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

2 Hyperscale computing driving need for efficiency Web-scale infrastructures with 10,000 s of servers At current trends, large-scale systems are expected to consume >100MW in 2018 Scale-out software developed to meet data needs, leverage infrastructure NoSQL: HBase, Cassandra, Couchbase, MongoDB, NoSQL Distributed application framework: Hadoop, Hive, Pregel, Cloud: Amazon Web Services, Azure, For many important workloads, must deliver data at very low latency Distributed Everything runs from memory in Web 2.0 Evan Weaver application In-memory databases, web search indices framework In-memory distributed caching Cloud Prineville, OR The Dalles, OR 2 2 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

3 SoC-based servers: Reaching a tipping point Aggressive SoCs are largely absent in general purpose CPUs Integration trend: fairly slow over time Caches, Memory controller, Northbridge, GPUs Today: mounting cost & energy pressure in datacenters accelerating need for SoCs Energy fewer pin crossings, more efficient blocks (heterogeneous) Cost fewer sockets, BoM, etc. 3 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

4 SoCs enable high-density computing Example: HP s Project Moonshot Redstone platform Calxeda EnergyCore (quad core SoC) 4 SoCs/board x 18 boards = 72 SoCs in 1U, 2880 servers per rack Just the beginning of ultra-dense servers Multiple SoCs per server node ensembles Dense tray-level aggregation, shared everything (I/O, management) Optimized in cost, space and power for scale-out 4 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

5 But data-centric challenges still remain Drive towards low latency Time is money! Data explosion Data growing faster than technology General purpose CPUs are not the answer Ill-matched to server workloads Most of time waiting for data rather than computing Opportunity to specialize for data-centric workloads 5 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

6 How do data-centric workloads access data? Exemplar: In-memory databases Databases create and use an index Data structures for fast data lookup Most often balanced tree or hash table Frequently accessed Hash Table Tree 6 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

7 SoC specialization: Database Indexing Widget Index lookups on general-purpose CPUs: Pointer-intensive core) Time-intensive low IPC (as low as 0.25 on OoO poor energy-efficiency Database Indexing Widget Dedicated hardware for database index lookups Full-service offload: core sleeps when widget runs Up to 65% less energy per query 7 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

8 Outline SoC Motivation Big Data Challenges Indexing in Databases Indexing Widget Results Opportunities and Challenges 8 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

9 Modern databases and indexing Two types of contemporary in-memory databases: Column-store analytical processing Customer Date Product Scale-out transaction processing Customer Date Product Customer Date Product with DSS with OLTP Two fundamental indexing operations Hash table probe Tree traversal 9 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

10 Total Execution Time How much time is spent indexing? Measurement on Xeon 5670 CPU with HW Counters Hash Table Tree Tree Hash Table Hash Table Hash Table 10 0 Payment OLTP Order Status Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Query 2 Query 9 Query 11 Query 17 DSS Indexing can account for up to 73% of execution

11 Example: Indexing with Tree Traversals SQL : SELECT A_Product,A_Customer FROM A WHERE A_age = 25 Index on A_age Ke y Tuple Ptr Custome r Age Date Product Resul t 11 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

12 Example: Indexing with Tree Traversals SQL : SELECT A_Product,A_Customer FROM A WHERE A_age = 25 Index on A_age Ke y Each index traversal : 10K-15K dynamic instructions: lots of pointer chasing 50-60% memory ref. 12 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

13 Outline SoC Motivation Big Data Challenges Indexing in Databases Indexing Widget Results Opportunities and Challenges 13 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

14 Indexing Widget Overview Dedicated offload engine for index lookups Activated on-demand by the core Full-service index lookup Core sleeps when widget runs Widget features Efficient: Specialized control and functional units Low-latency: Caches frequently-accessed index data Tightly-integrated: Uses core s L1-D and TLB 14 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

15 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❶Configur e ❷Run ❸Return Controller (FSM) init. tree hash eval. write end Buffer (SRAM) 15 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

16 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Controller (FSM) init. tree hash eval. write end Computation al Logic Buffer (SRAM) ❶Configur e If (haswidget) { widget.index=&a; widget.key=&b; widget.type=equal; widget.result=&r; widget.data= int; widget.run(); } else { Hashprobe(); } 16 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

17 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❷Run Controller (FSM) init. tree hash eval. write end Buffer (SRAM) 17 To/From L1 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

18 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❸Return Controller (FSM) init. tree eval. end Buffer (SRAM) 18 hash write To/From L1 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Stor &Result e Table, &Result Table, &Result Key Table, Key Key

19 Widget Details Indexing Widget 1) Private L1 Design 2) Shared L2 Design 19 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

20 Methodology First-order analytical model Execution traces: Pin Execution profiling: Vtune, Oprofile 20 Benchmark Applications OLTP: TPC-C on VoltDB DSS: TPC-H on MonetDB Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Model Parameters L1 / L2 / LLC/Off-chip latency: 4 / 6 / 29/ 180 cycles Widget buffer: Fully associative cache Energy Estimations Mcpat

21 Speedup Performance with Indexing Widget Small Tree Large Tree Hash Table 7 512B 1KB 2KB 4KB 8KB Need to accelerate both instructions and memory Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

22 Reduction in Energy (%) Energy Efficiency with Indexing Widget Qry 9 Qry 11 Order S. Qry 2 Payment Qry 17 Reduction over Conventional OoO Reduction over ARM-like* OoO Application Coverage (%) Up to 65% reduction in energy Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

23 Summary Data explosion, business trends call for specialization SoC servers are building block for improved efficiency Even greater rethinking of architectures needed Exemplar data-centric workload: In-memory Database Spends significant time in indexing Mostly pointer chasing: general purpose CPUs are poorly suited Augment CPU with indexing widget Dedicated offload engine: core sleeps when widget runs Improves efficiency: 65% less energy, 6x faster query execution 23 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

24 Opportunities and Challenges Many other classes of data-centric workloads Unstructured text search Key-value caches Automatic document summarization Distributed array processing But market cannot support unlimited specialization Are there key specialization elements that are common? Can we design reusable, purpose-built architectures? How do we provide portable software interfaces? 24 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

25 Thank you Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

26 Invasive Tightly-Coupled Processor Arrays Frank Hannig Hardware/Software Co-Design Department of Computer Science University of Erlangen-Nuremberg, Germany

27 Motivation Power Wall, only a constant power budget is available per device On the other hand, steady demand for more computing power Consequence: energy per operation must decrease These days, energy efficiency (OPS/Watt) is much more important than pure computing power (OPS) This trend goes across all levels, from portable devices such as mobile phones to high-performance computing systems OPS = Operations Per Second Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 2

28 Motivation cont d How can we achieve better energy efficiency? Heterogeneous architectures Multiple cores Domain-specific components, such as Customized instruction set extensions Accelerators (GPUs, FPGAs, dedicated hardware) Customization and heterogeneity are the key to success for future performance gains Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 3

29 Overview Tightly-coupled processor arrays Invasive computing and invasive tightly-coupled processor arrays Power management Tools and application mapping Conclusions Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 4

30 Tightly-coupled processor arrays (TCPAs) Class of massively parallel on-chip processor architectures Highly customizable at synthesis-time Many parameters and configuration options E.g., number of processors, width of data path Flexibility at run-time Programmable, reconfigurable interconnect Used as accelerators in MPSoC or tiled architectures for compute-intensive tasks from domains such as Digital signal processing, audio and video Image processing, object recognition Linear algebra (matrix / vector computations) Cryptography... and other streaming applications Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 5

31 TCPAs cont d Processing elements (s) VLIW architecture Weakly-programmable Small instruction set and memory Small register file No direct main memory access Interconnect Switched connections Single-cycle latency Switching possibilities can be defined at synthesis-time, which enables the configuration of different interconnect topologies at run-time Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 6

32 TCPAs cont d Local memories structure Reconfigurable buffers at the borders of the array FIFOs at s inputs Feedback shift registers for cyclic data reuse within s (enable efficient handling of loop carried data dependencies) Zero-overhead static control flow Multiway branch unit in each Combined with globally generated control predicates Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 7

33 Resource management and mapping Traditional approach Small number of s Single application is running exclusively on array Static mapping onto resources Reality and future Beware of Moore s law Exponential growth Number of s doubles every 18 to 24 months Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 8

34 Challenges Questions: How to map simultaneously several applications with run-time variant constraints on such architectures? How to manage hundreds to thousands of compute resources? How to save energy in phases when only less computational power is needed? Can the architecture adapt to temporary (e.g., overheated regions) or permanent faults? Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 9

35 Invasive computing In general, as defined by Teich: A distributed resource-aware computing methodology that gives each application the ability to explore and claim (invade) available computing resources, and copy its configuration code to such captured resources (infect), and then to execute the given program in parallel. After finishing a phase of execution, the application may free its previously occupied resources by performing a release operation (retreat). Invasion phases on TCPAs 1. Invade: explores the availability of resources in its neighborhood by sending invade messages 2. Infect: The captured (invaded) s are configured with the application 3. Retreat: Each processing element frees its captured neighbors by sending retreat messages Application Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 10

36 Invasive TCPAs Each is equipped with an invasion controller (ictrl) to enable fast and decentralized resource management Design choices Different resource exploration methods (invasion strategies): Linearly connected region of s Rectangular connected region of s Different designs for invasion controller: FSM-based Programmable Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 11

37 Invasion strategies Linear invasion, different polices Capturing of s in straight-walk fashion with maximal length of walks Selection of an available neighbor in a random-walk fashion Capturing of s in a meander-walk fashion Straight policy Random policy Meander policy Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 12

38 Invasion strategies cont d Rectangular invasion Linear invasion in one dimension Combined with parallel invasion in the second dimension Results Linear invasion takes only 2 cycles per ; due to parallel execution, rectangular invasion is even faster Obviously, the meander policy for linear invasion is preferable since it defragments the array least Compared with centralized software based resource management, resource exploration and reservation can be achieved 50 times faster Moderate hardware cost of invasion controller ( LUTs in Virtex6 FPGA technology) Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 13

39 Power management in TCPAs Invasive computing as enabler for power management How to save static power? Power on invaded s Power off idle s Hierarchical power management strategy Separate power domains for: Invasion controller (first level of power management) Processing element (second level of power management) Processing Element Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 14

40 PMU PMU PMU PMU PMU PMU PMU PMU Hierarchical power management Invade phase t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p OFF ) delay t(p OFF ) delay t(p OFF ) delay t(p OFF ) delay Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Powered ON Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Powered OFF Invade Invade Claim Retreat phase works similar PMU : Power Management Unit Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 15

41 PMU PMU PMU PMU PMU PMU PMU PMU Implementation options Inv. Ctrl Inv. Ctrl Multiple Invasion Controllers are grouped into a single power domain to reduce area and invasion time Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 16

42 Implementation options cont d PMU PMU Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl PMU PMU Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 17

43 Implementation options cont d PMU PMU Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 18

44 Area (million NAND2 eq. gates) Power management results Depending on array utilization, a reduction of leakage power between 30%-90% can be achieved Trade-off: power, area, and invasion latency per for different power domain groupings Inv. Ctrl/domain Inv. Ctrl/domain 8 Inv. Ctrl/domain without Power Gating 2 Inv. Ctrl/domain Avg. invasion latency / (# cycles) Power: size and color of the bubble Experimental setup: - 10x12 TCPA - TSMC-65nm-LP - Synopsys tool chain - PVT corner: fast_1.32v_125c [max leakage corner] - post synthesis results Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 19

45 Tool suite Integrated development environment Graphical design entry, i.e., parameterization of hundreds of design options; alternatively a powerful ADL (architecture description language) can be used Push-button generation of Fast cycle-accurate simulator, can be wrapped into SystemC for co-simulation Synthesizable HDL code Configuration streams Structural assembly code entry and intuitive (graphical) interconnect setup Back annotation and visualization of simulation results offer rich debugging possibilities Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 20

46 Application mapping High-level programming environment LoopInvader Input: DSL (domain-specific language) PAULA Functional language Domain of multi-dimensional affine recurrence equations defined over polyhedral iteration spaces Multi-dimensional reductions Algorithm (DSL PAULA) High-Level Transformations Localization Loop Perfectization Output Normal Form Loop Unrolling Partitioning Expression Splitting Affine Transformations... Allocation Space-Time Mapping Scheduling Resource Binding Code Generation VLIW Code for each Configuration of Interconnect Code of Controller TCPA Configuration Simulation Architecture Model Simulation TCPA Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 21

47 Application mapping cont d DSL PAULA example, discrete 2-D Gauss window filter... par (x >= 0 and x < 1280 and y >= 0 and y < 1024) { w[0,0]=1; w[0,1]=2; w[0,2]=1; w[1,0]=2; w[1,1]=4; w[1,2]=2; w[2,0]=1; w[2,1]=2; w[2,2]=1; } h[x,y] = SUM[i>=0 and i<=2 and j>=0 and j<=2] (pic_in[x+i,y+j] * w[i,j]); pic_out[x,y] = h[x,y] >> 4; // divided by 16 pic out x, y = pic x i, y j w i, j i 0 2 j 0 in Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 22

48 Application mapping cont d Loop partitioning is used for parallelization and mapping dependence graph architecture Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 23

49 Application mapping cont d Current research: Symbolic loop parallelization and mapping Symbolic partitioning Symbolic scheduling Symbolic control generation Replace at run-time only the symbols in the configuration data according to the available resources Algorithms can be adapted quickly at run-time without recompilation Self-adaption enables fast reaction on QoS parameters, system load, failures, etc. Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 24

50 Conclusions Invasive Tightly-Coupled Processor Arrays: a domain-specific remedy for the arising challenges in multi-processor architectures Hardware-support for ultra-fast resource reservation Invasive computing as an enabler for power management Careful co-design of domain-specific architectures, domain-specific languages, and mapping tools is needed in order to achieve energy efficient solutions, while simultaneously ensuring high productivity (time-to-market) as well as portability and adaptability Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 25

51 Further information Invasive Tightly-Coupled Processor Arrays Contact Frank Hannig Hardware/Software Co-Design, Department of CS University of Erlangen-Nuremberg Cauerstr. 11, Erlangen, Germany Phone: Acknowledgements Contributors (in alphabetical order): Srinivas Boppu, Frank Hannig, Dmitrij Kissler, Alexej Kupriyanov, Vahid Lari, Shravan Muddasani, Alex Tanase, Jürgen Teich This work is supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre Invasive Computing (SFB/TR 89) Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 26

52 References [1] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich, A highly parameterizible parallel processor array architecture, in Proc. of the IEEE Int. Conf. on Field Programmable Technology (FPT), pp , Bangkok, Thailand, Dec [2] J. Teich. Invasive Algorithms and Architectures. Journal it Information Technology, 50 (2008), pp , [3] J. Teich, J. Henkel, A. Herkersdorf, D. Schmitt-Landsiedel, W. Schröder- Preikschat, and G. Snelting, Invasive Computing: An Overview, in Multiprocessor System-on-Chip Hardware Design and Tool Integration, Springer, Berlin, 2011, pp [4] V. Lari, A. Narovlyanskyy, F. Hannig, and J. Teich, Decentralized Dynamic Resource Management Support for Massively Parallel Processor Arrays, In Proc. of the 22nd IEEE Int. Conf. on Application-specific Systems, Architectures, and Processors (ASAP), pp , Santa Monica, CA, USA, Sep [5] D. Kissler, D. Gran, Z. Salcic, F. Hannig, and J. Teich, Scalable Many-Domain Power Gating in Coarse-grained Reconfigurable Processor Arrays, IEEE Embedded Systems Letters, 3(2):58-61, [6] V. Lari, S. Muddasani, S. Boppu, F. Hannig, M. Schmid, and J. Teich, Hierarchical Power Management for Adaptive Tightly-Coupled Processor Arrays, To appear in ACM Transactions on Design Automation of Electronic Systems (TODAES), Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 27

Dark Silicon Accelerators for Database Indexing

Dark Silicon Accelerators for Database Indexing Dark Silicon Accelerators for Database Indexing Onur Kocberber, Kevin Lim, Babak Falsafi, Partha Ranganathan, Stavros Harizopoulos Dark Silicon and Big Data Challenges Data explosion Data growing faster

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases"

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides

More information

Massively Parallel Processor Architectures for Resource-aware Computing

Massively Parallel Processor Architectures for Resource-aware Computing Massively Parallel Processor Architectures for Resource-aware Computing Vahid Lari, Alexandru Tanase, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, Department of Computer Science Friedrich-Alexander

More information

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

ERCBench An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing

ERCBench An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing ERCBench An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing Daniel Chang Chris Jenkins, Philip Garcia, Syed Gilani, Paula Aguilera, Aishwarya Nagarajan, Michael Anderson, Matthew

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

Using FPGAs as Microservices

Using FPGAs as Microservices Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

Dr. Yassine Hariri CMC Microsystems

Dr. Yassine Hariri CMC Microsystems Dr. Yassine Hariri Hariri@cmc.ca CMC Microsystems 03-26-2013 Agenda MCES Workshop Agenda and Topics Canada s National Design Network and CMC Microsystems Processor Eras: Background and History Single core

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Acceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays

Acceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays Acceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays Éricles Rodrigues Sousa 1, Alexandru Tanase 1,VahidLari 1, Frank Hannig 1,Jürgen Teich 1, Johny Paul 2, Walter Stechele 2,

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Toward a Memory-centric Architecture

Toward a Memory-centric Architecture Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains

More information

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific

More information

HyPer-sonic Combined Transaction AND Query Processing

HyPer-sonic Combined Transaction AND Query Processing HyPer-sonic Combined Transaction AND Query Processing Thomas Neumann Technische Universität München December 2, 2011 Motivation There are different scenarios for database usage: OLTP: Online Transaction

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information

Ten Reasons to Optimize a Processor

Ten Reasons to Optimize a Processor By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

Hypervisors at Hyperscale

Hypervisors at Hyperscale Hypervisors at Hyperscale ARM, Xen, Servers and Evolution of the Data Center Larry Wikelius Co-Founder & VP Software 1 Overview l Market Dynamics l Technology Trends l Roadmaps Where are we today l Use

More information

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE

BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE E-Guide BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE SearchServer Virtualization P art 1 of this series explores how trends in buying server hardware have been influenced by the scale-up

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Was ist dran an einer spezialisierten Data Warehousing platform?

Was ist dran an einer spezialisierten Data Warehousing platform? Was ist dran an einer spezialisierten Data Warehousing platform? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Data warehousing, Exadata, specialized hardware proprietary hardware Introduction

More information

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Table of Contents: The Accelerated Data Center Optimizing Data Center Productivity Same Throughput with Fewer Server Nodes

More information

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs 1/29 FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu 2/29 Claim FPGA overlay NoCs

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query

More information

2. TOPOLOGICAL PATTERN ANALYSIS

2. TOPOLOGICAL PATTERN ANALYSIS Methodology for analyzing and quantifying design style changes and complexity using topological patterns Jason P. Cain a, Ya-Chieh Lai b, Frank Gennari b, Jason Sweis b a Advanced Micro Devices, 7171 Southwest

More information

The Challenges of System Design. Raising Performance and Reducing Power Consumption

The Challenges of System Design. Raising Performance and Reducing Power Consumption The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software

More information

A Novel Deadlock Avoidance Algorithm and Its Hardware Implementation

A Novel Deadlock Avoidance Algorithm and Its Hardware Implementation A ovel Deadlock Avoidance Algorithm and Its Hardware Implementation + Jaehwan Lee and *Vincent* J. Mooney III Hardware/Software RTOS Group Center for Research on Embedded Systems and Technology (CREST)

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA Scaling Datacenter Accelerators With Compute-Reuse Architectures

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model NoC Simulation in Heterogeneous Architectures for PGAS Programming Model Sascha Roloff, Andreas Weichslgartner, Frank Hannig, Jürgen Teich University of Erlangen-Nuremberg, Germany Jan Heißwolf Karlsruhe

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card The Rise of MongoDB Summary One of today s growing database

More information

SPARK: A Parallelizing High-Level Synthesis Framework

SPARK: A Parallelizing High-Level Synthesis Framework SPARK: A Parallelizing High-Level Synthesis Framework Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Chapter 8: Memory-Management Strategies

Chapter 8: Memory-Management Strategies Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

HyPer-sonic Combined Transaction AND Query Processing

HyPer-sonic Combined Transaction AND Query Processing HyPer-sonic Combined Transaction AND Query Processing Thomas Neumann Technische Universität München October 26, 2011 Motivation - OLTP vs. OLAP OLTP and OLAP have very different requirements OLTP high

More information

Reconfigurable Computing. Design and Implementation. Chapter 4.1

Reconfigurable Computing. Design and Implementation. Chapter 4.1 Design and Implementation Chapter 4.1 Prof. Dr.-Ing. Jürgen Teich Lehrstuhl für Hardware-Software-Co-Design In System Integration System Integration Rapid Prototyping Reconfigurable devices (RD) are usually

More information

Understanding and Improving the Cost of Scaling Distributed Event Processing

Understanding and Improving the Cost of Scaling Distributed Event Processing Understanding and Improving the Cost of Scaling Distributed Event Processing Shoaib Akram, Manolis Marazakis, and Angelos Bilas shbakram@ics.forth.gr Foundation for Research and Technology Hellas (FORTH)

More information

Advanced RDMA-based Admission Control for Modern Data-Centers

Advanced RDMA-based Admission Control for Modern Data-Centers Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Chapter 8: Main Memory

Chapter 8: Main Memory Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:

More information

When, Where & Why to Use NoSQL?

When, Where & Why to Use NoSQL? When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),

More information

Computer Architecture. What is it?

Computer Architecture. What is it? Computer Architecture Venkatesh Akella EEC 270 Winter 2005 What is it? EEC270 Computer Architecture Basically a story of unprecedented improvement $1K buys you a machine that was 1-5 million dollars a

More information

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator

Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid

More information

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

CHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed. CHAPTER 8: MEMORY MANAGEMENT By I-Chen Lin Textbook: Operating System Concepts 9th Ed. Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the

More information

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies

PRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies Object Storage 201 PRESENTATION TITLE GOES HERE Understanding Architectural Trade-offs in Object Storage Technologies SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Big Data It s not just for Google Any More

Big Data It s not just for Google Any More Big Data It s not just for Google Any More The Software and Compelling Economics of Big Data Computing EXECUTIVE SUMMARY Big Data holds out the promise of providing businesses with differentiated competitive

More information

Instruction Encoding Synthesis For Architecture Exploration

Instruction Encoding Synthesis For Architecture Exploration Instruction Encoding Synthesis For Architecture Exploration "Compiler Optimizations for Code Density of Variable Length Instructions", "Heuristics for Greedy Transport Triggered Architecture Interconnect

More information

Parallelism. CS6787 Lecture 8 Fall 2017

Parallelism. CS6787 Lecture 8 Fall 2017 Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does

More information

Cover TBD. intel Quartus prime Design software

Cover TBD. intel Quartus prime Design software Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a

More information

AMD Disaggregates the Server, Defines New Hyperscale Building Block

AMD Disaggregates the Server, Defines New Hyperscale Building Block AMD Disaggregates the Server, Defines New Hyperscale Building Block Fabric Based Architecture Enables Next Generation Data Center Optimization Executive Summary AMD SeaMicro s disaggregated server enables

More information

The Design and Implementation of a Low-Latency On-Chip Network

The Design and Implementation of a Low-Latency On-Chip Network The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current

More information

designing a GPU Computing Solution

designing a GPU Computing Solution designing a GPU Computing Solution Patrick Van Reeth EMEA HPC Competency Center - GPU Computing Solutions Saturday, May the 29th, 2010 1 2010 Hewlett-Packard Development Company, L.P. The information contained

More information

Storage Networking Strategy for the Next Five Years

Storage Networking Strategy for the Next Five Years White Paper Storage Networking Strategy for the Next Five Years 2018 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 1 of 8 Top considerations for storage

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

DCBench: a Data Center Benchmark Suite

DCBench: a Data Center Benchmark Suite DCBench: a Data Center Benchmark Suite Zhen Jia ( 贾禛 ) http://prof.ict.ac.cn/zhenjia/ Institute of Computing Technology, Chinese Academy of Sciences workshop in conjunction with CCF October 31,2013,Guilin

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Decentralized Distributed Storage System for Big Data

Decentralized Distributed Storage System for Big Data Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Outline Trends in Big and Cloud Storage

More information

Configurable String Matching Hardware for Speeding up Intrusion Detection

Configurable String Matching Hardware for Speeding up Intrusion Detection Configurable String Matching Hardware for Speeding up Intrusion Detection Monther Aldwairi, Thomas Conte, Paul Franzon Dec 6, 2004 North Carolina State University {mmaldwai, conte, paulf}@ncsu.edu www.ece.ncsu.edu/erl

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures

Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Outline Research challenges in multicore

More information

Software and Tools for HPE s The Machine Project

Software and Tools for HPE s The Machine Project Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric

More information

OCP Engineering Workshop - Telco

OCP Engineering Workshop - Telco OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Low-Power Processor Solutions for Always-on Devices

Low-Power Processor Solutions for Always-on Devices Low-Power Processor Solutions for Always-on Devices Pieter van der Wolf MPSoC 2014 July 7 11, 2014 2014 Synopsys, Inc. All rights reserved. 1 Always-on Mobile Devices Mobile devices on the move Mobile

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

ScaleArc for SQL Server

ScaleArc for SQL Server Solution Brief ScaleArc for SQL Server Overview Organizations around the world depend on SQL Server for their revenuegenerating, customer-facing applications, running their most business-critical operations

More information

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh. Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the

More information

Chapter 8: Main Memory. Operating System Concepts 9 th Edition

Chapter 8: Main Memory. Operating System Concepts 9 th Edition Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel

More information

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches

Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

OLAP Introduction and Overview

OLAP Introduction and Overview 1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata

More information