Domain-specific Architectures for Emerging Data-Centric Workloads
|
|
- Leo Archibald Cross
- 5 years ago
- Views:
Transcription
1 Domain-specific Architectures for Emerging Data-Centric Workloads Kevin Lim November 8, 2012 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
2 Hyperscale computing driving need for efficiency Web-scale infrastructures with 10,000 s of servers At current trends, large-scale systems are expected to consume >100MW in 2018 Scale-out software developed to meet data needs, leverage infrastructure NoSQL: HBase, Cassandra, Couchbase, MongoDB, NoSQL Distributed application framework: Hadoop, Hive, Pregel, Cloud: Amazon Web Services, Azure, For many important workloads, must deliver data at very low latency Distributed Everything runs from memory in Web 2.0 Evan Weaver application In-memory databases, web search indices framework In-memory distributed caching Cloud Prineville, OR The Dalles, OR 2 2 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
3 SoC-based servers: Reaching a tipping point Aggressive SoCs are largely absent in general purpose CPUs Integration trend: fairly slow over time Caches, Memory controller, Northbridge, GPUs Today: mounting cost & energy pressure in datacenters accelerating need for SoCs Energy fewer pin crossings, more efficient blocks (heterogeneous) Cost fewer sockets, BoM, etc. 3 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
4 SoCs enable high-density computing Example: HP s Project Moonshot Redstone platform Calxeda EnergyCore (quad core SoC) 4 SoCs/board x 18 boards = 72 SoCs in 1U, 2880 servers per rack Just the beginning of ultra-dense servers Multiple SoCs per server node ensembles Dense tray-level aggregation, shared everything (I/O, management) Optimized in cost, space and power for scale-out 4 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
5 But data-centric challenges still remain Drive towards low latency Time is money! Data explosion Data growing faster than technology General purpose CPUs are not the answer Ill-matched to server workloads Most of time waiting for data rather than computing Opportunity to specialize for data-centric workloads 5 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
6 How do data-centric workloads access data? Exemplar: In-memory databases Databases create and use an index Data structures for fast data lookup Most often balanced tree or hash table Frequently accessed Hash Table Tree 6 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
7 SoC specialization: Database Indexing Widget Index lookups on general-purpose CPUs: Pointer-intensive core) Time-intensive low IPC (as low as 0.25 on OoO poor energy-efficiency Database Indexing Widget Dedicated hardware for database index lookups Full-service offload: core sleeps when widget runs Up to 65% less energy per query 7 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
8 Outline SoC Motivation Big Data Challenges Indexing in Databases Indexing Widget Results Opportunities and Challenges 8 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
9 Modern databases and indexing Two types of contemporary in-memory databases: Column-store analytical processing Customer Date Product Scale-out transaction processing Customer Date Product Customer Date Product with DSS with OLTP Two fundamental indexing operations Hash table probe Tree traversal 9 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
10 Total Execution Time How much time is spent indexing? Measurement on Xeon 5670 CPU with HW Counters Hash Table Tree Tree Hash Table Hash Table Hash Table 10 0 Payment OLTP Order Status Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Query 2 Query 9 Query 11 Query 17 DSS Indexing can account for up to 73% of execution
11 Example: Indexing with Tree Traversals SQL : SELECT A_Product,A_Customer FROM A WHERE A_age = 25 Index on A_age Ke y Tuple Ptr Custome r Age Date Product Resul t 11 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
12 Example: Indexing with Tree Traversals SQL : SELECT A_Product,A_Customer FROM A WHERE A_age = 25 Index on A_age Ke y Each index traversal : 10K-15K dynamic instructions: lots of pointer chasing 50-60% memory ref. 12 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
13 Outline SoC Motivation Big Data Challenges Indexing in Databases Indexing Widget Results Opportunities and Challenges 13 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
14 Indexing Widget Overview Dedicated offload engine for index lookups Activated on-demand by the core Full-service index lookup Core sleeps when widget runs Widget features Efficient: Specialized control and functional units Low-latency: Caches frequently-accessed index data Tightly-integrated: Uses core s L1-D and TLB 14 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
15 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❶Configur e ❷Run ❸Return Controller (FSM) init. tree hash eval. write end Buffer (SRAM) 15 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
16 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Controller (FSM) init. tree hash eval. write end Computation al Logic Buffer (SRAM) ❶Configur e If (haswidget) { widget.index=&a; widget.key=&b; widget.type=equal; widget.result=&r; widget.data= int; widget.run(); } else { Hashprobe(); } 16 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
17 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❷Run Controller (FSM) init. tree hash eval. write end Buffer (SRAM) 17 To/From L1 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
18 Widget Details From Core Configuration Registers Index Addr. Key Search Type Result Table Addr. Data type Computation al Logic ❸Return Controller (FSM) init. tree eval. end Buffer (SRAM) 18 hash write To/From L1 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Stor &Result e Table, &Result Table, &Result Key Table, Key Key
19 Widget Details Indexing Widget 1) Private L1 Design 2) Shared L2 Design 19 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
20 Methodology First-order analytical model Execution traces: Pin Execution profiling: Vtune, Oprofile 20 Benchmark Applications OLTP: TPC-C on VoltDB DSS: TPC-H on MonetDB Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Model Parameters L1 / L2 / LLC/Off-chip latency: 4 / 6 / 29/ 180 cycles Widget buffer: Fully associative cache Energy Estimations Mcpat
21 Speedup Performance with Indexing Widget Small Tree Large Tree Hash Table 7 512B 1KB 2KB 4KB 8KB Need to accelerate both instructions and memory Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
22 Reduction in Energy (%) Energy Efficiency with Indexing Widget Qry 9 Qry 11 Order S. Qry 2 Payment Qry 17 Reduction over Conventional OoO Reduction over ARM-like* OoO Application Coverage (%) Up to 65% reduction in energy Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
23 Summary Data explosion, business trends call for specialization SoC servers are building block for improved efficiency Even greater rethinking of architectures needed Exemplar data-centric workload: In-memory Database Spends significant time in indexing Mostly pointer chasing: general purpose CPUs are poorly suited Augment CPU with indexing widget Dedicated offload engine: core sleeps when widget runs Improves efficiency: 65% less energy, 6x faster query execution 23 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
24 Opportunities and Challenges Many other classes of data-centric workloads Unstructured text search Key-value caches Automatic document summarization Distributed array processing But market cannot support unlimited specialization Are there key specialization elements that are common? Can we design reusable, purpose-built architectures? How do we provide portable software interfaces? 24 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
25 Thank you Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.
26 Invasive Tightly-Coupled Processor Arrays Frank Hannig Hardware/Software Co-Design Department of Computer Science University of Erlangen-Nuremberg, Germany
27 Motivation Power Wall, only a constant power budget is available per device On the other hand, steady demand for more computing power Consequence: energy per operation must decrease These days, energy efficiency (OPS/Watt) is much more important than pure computing power (OPS) This trend goes across all levels, from portable devices such as mobile phones to high-performance computing systems OPS = Operations Per Second Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 2
28 Motivation cont d How can we achieve better energy efficiency? Heterogeneous architectures Multiple cores Domain-specific components, such as Customized instruction set extensions Accelerators (GPUs, FPGAs, dedicated hardware) Customization and heterogeneity are the key to success for future performance gains Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 3
29 Overview Tightly-coupled processor arrays Invasive computing and invasive tightly-coupled processor arrays Power management Tools and application mapping Conclusions Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 4
30 Tightly-coupled processor arrays (TCPAs) Class of massively parallel on-chip processor architectures Highly customizable at synthesis-time Many parameters and configuration options E.g., number of processors, width of data path Flexibility at run-time Programmable, reconfigurable interconnect Used as accelerators in MPSoC or tiled architectures for compute-intensive tasks from domains such as Digital signal processing, audio and video Image processing, object recognition Linear algebra (matrix / vector computations) Cryptography... and other streaming applications Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 5
31 TCPAs cont d Processing elements (s) VLIW architecture Weakly-programmable Small instruction set and memory Small register file No direct main memory access Interconnect Switched connections Single-cycle latency Switching possibilities can be defined at synthesis-time, which enables the configuration of different interconnect topologies at run-time Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 6
32 TCPAs cont d Local memories structure Reconfigurable buffers at the borders of the array FIFOs at s inputs Feedback shift registers for cyclic data reuse within s (enable efficient handling of loop carried data dependencies) Zero-overhead static control flow Multiway branch unit in each Combined with globally generated control predicates Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 7
33 Resource management and mapping Traditional approach Small number of s Single application is running exclusively on array Static mapping onto resources Reality and future Beware of Moore s law Exponential growth Number of s doubles every 18 to 24 months Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 8
34 Challenges Questions: How to map simultaneously several applications with run-time variant constraints on such architectures? How to manage hundreds to thousands of compute resources? How to save energy in phases when only less computational power is needed? Can the architecture adapt to temporary (e.g., overheated regions) or permanent faults? Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 9
35 Invasive computing In general, as defined by Teich: A distributed resource-aware computing methodology that gives each application the ability to explore and claim (invade) available computing resources, and copy its configuration code to such captured resources (infect), and then to execute the given program in parallel. After finishing a phase of execution, the application may free its previously occupied resources by performing a release operation (retreat). Invasion phases on TCPAs 1. Invade: explores the availability of resources in its neighborhood by sending invade messages 2. Infect: The captured (invaded) s are configured with the application 3. Retreat: Each processing element frees its captured neighbors by sending retreat messages Application Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 10
36 Invasive TCPAs Each is equipped with an invasion controller (ictrl) to enable fast and decentralized resource management Design choices Different resource exploration methods (invasion strategies): Linearly connected region of s Rectangular connected region of s Different designs for invasion controller: FSM-based Programmable Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 11
37 Invasion strategies Linear invasion, different polices Capturing of s in straight-walk fashion with maximal length of walks Selection of an available neighbor in a random-walk fashion Capturing of s in a meander-walk fashion Straight policy Random policy Meander policy Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 12
38 Invasion strategies cont d Rectangular invasion Linear invasion in one dimension Combined with parallel invasion in the second dimension Results Linear invasion takes only 2 cycles per ; due to parallel execution, rectangular invasion is even faster Obviously, the meander policy for linear invasion is preferable since it defragments the array least Compared with centralized software based resource management, resource exploration and reservation can be achieved 50 times faster Moderate hardware cost of invasion controller ( LUTs in Virtex6 FPGA technology) Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 13
39 Power management in TCPAs Invasive computing as enabler for power management How to save static power? Power on invaded s Power off idle s Hierarchical power management strategy Separate power domains for: Invasion controller (first level of power management) Processing element (second level of power management) Processing Element Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 14
40 PMU PMU PMU PMU PMU PMU PMU PMU Hierarchical power management Invade phase t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p ON ) delay t(p OFF ) delay t(p OFF ) delay t(p OFF ) delay t(p OFF ) delay Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Powered ON Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Powered OFF Invade Invade Claim Retreat phase works similar PMU : Power Management Unit Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 15
41 PMU PMU PMU PMU PMU PMU PMU PMU Implementation options Inv. Ctrl Inv. Ctrl Multiple Invasion Controllers are grouped into a single power domain to reduce area and invasion time Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 16
42 Implementation options cont d PMU PMU Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl PMU PMU Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 17
43 Implementation options cont d PMU PMU Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Inv. Ctrl Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 18
44 Area (million NAND2 eq. gates) Power management results Depending on array utilization, a reduction of leakage power between 30%-90% can be achieved Trade-off: power, area, and invasion latency per for different power domain groupings Inv. Ctrl/domain Inv. Ctrl/domain 8 Inv. Ctrl/domain without Power Gating 2 Inv. Ctrl/domain Avg. invasion latency / (# cycles) Power: size and color of the bubble Experimental setup: - 10x12 TCPA - TSMC-65nm-LP - Synopsys tool chain - PVT corner: fast_1.32v_125c [max leakage corner] - post synthesis results Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 19
45 Tool suite Integrated development environment Graphical design entry, i.e., parameterization of hundreds of design options; alternatively a powerful ADL (architecture description language) can be used Push-button generation of Fast cycle-accurate simulator, can be wrapped into SystemC for co-simulation Synthesizable HDL code Configuration streams Structural assembly code entry and intuitive (graphical) interconnect setup Back annotation and visualization of simulation results offer rich debugging possibilities Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 20
46 Application mapping High-level programming environment LoopInvader Input: DSL (domain-specific language) PAULA Functional language Domain of multi-dimensional affine recurrence equations defined over polyhedral iteration spaces Multi-dimensional reductions Algorithm (DSL PAULA) High-Level Transformations Localization Loop Perfectization Output Normal Form Loop Unrolling Partitioning Expression Splitting Affine Transformations... Allocation Space-Time Mapping Scheduling Resource Binding Code Generation VLIW Code for each Configuration of Interconnect Code of Controller TCPA Configuration Simulation Architecture Model Simulation TCPA Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 21
47 Application mapping cont d DSL PAULA example, discrete 2-D Gauss window filter... par (x >= 0 and x < 1280 and y >= 0 and y < 1024) { w[0,0]=1; w[0,1]=2; w[0,2]=1; w[1,0]=2; w[1,1]=4; w[1,2]=2; w[2,0]=1; w[2,1]=2; w[2,2]=1; } h[x,y] = SUM[i>=0 and i<=2 and j>=0 and j<=2] (pic_in[x+i,y+j] * w[i,j]); pic_out[x,y] = h[x,y] >> 4; // divided by 16 pic out x, y = pic x i, y j w i, j i 0 2 j 0 in Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 22
48 Application mapping cont d Loop partitioning is used for parallelization and mapping dependence graph architecture Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 23
49 Application mapping cont d Current research: Symbolic loop parallelization and mapping Symbolic partitioning Symbolic scheduling Symbolic control generation Replace at run-time only the symbols in the configuration data according to the available resources Algorithms can be adapted quickly at run-time without recompilation Self-adaption enables fast reaction on QoS parameters, system load, failures, etc. Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 24
50 Conclusions Invasive Tightly-Coupled Processor Arrays: a domain-specific remedy for the arising challenges in multi-processor architectures Hardware-support for ultra-fast resource reservation Invasive computing as an enabler for power management Careful co-design of domain-specific architectures, domain-specific languages, and mapping tools is needed in order to achieve energy efficient solutions, while simultaneously ensuring high productivity (time-to-market) as well as portability and adaptability Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 25
51 Further information Invasive Tightly-Coupled Processor Arrays Contact Frank Hannig Hardware/Software Co-Design, Department of CS University of Erlangen-Nuremberg Cauerstr. 11, Erlangen, Germany Phone: Acknowledgements Contributors (in alphabetical order): Srinivas Boppu, Frank Hannig, Dmitrij Kissler, Alexej Kupriyanov, Vahid Lari, Shravan Muddasani, Alex Tanase, Jürgen Teich This work is supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre Invasive Computing (SFB/TR 89) Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 26
52 References [1] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich, A highly parameterizible parallel processor array architecture, in Proc. of the IEEE Int. Conf. on Field Programmable Technology (FPT), pp , Bangkok, Thailand, Dec [2] J. Teich. Invasive Algorithms and Architectures. Journal it Information Technology, 50 (2008), pp , [3] J. Teich, J. Henkel, A. Herkersdorf, D. Schmitt-Landsiedel, W. Schröder- Preikschat, and G. Snelting, Invasive Computing: An Overview, in Multiprocessor System-on-Chip Hardware Design and Tool Integration, Springer, Berlin, 2011, pp [4] V. Lari, A. Narovlyanskyy, F. Hannig, and J. Teich, Decentralized Dynamic Resource Management Support for Massively Parallel Processor Arrays, In Proc. of the 22nd IEEE Int. Conf. on Application-specific Systems, Architectures, and Processors (ASAP), pp , Santa Monica, CA, USA, Sep [5] D. Kissler, D. Gran, Z. Salcic, F. Hannig, and J. Teich, Scalable Many-Domain Power Gating in Coarse-grained Reconfigurable Processor Arrays, IEEE Embedded Systems Letters, 3(2):58-61, [6] V. Lari, S. Muddasani, S. Boppu, F. Hannig, M. Schmid, and J. Teich, Hierarchical Power Management for Adaptive Tightly-Coupled Processor Arrays, To appear in ACM Transactions on Design Automation of Electronic Systems (TODAES), Frank Hannig Int. Workshop on Domain-Specific Multicore Computing Nov. 8, 2012 San Jose, CA 27
Dark Silicon Accelerators for Database Indexing
Dark Silicon Accelerators for Database Indexing Onur Kocberber, Kevin Lim, Babak Falsafi, Partha Ranganathan, Stavros Harizopoulos Dark Silicon and Big Data Challenges Data explosion Data growing faster
More informationRuntime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays
Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann
More informationMeet the Walkers! Accelerating Index Traversals for In-Memory Databases"
Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides
More informationMassively Parallel Processor Architectures for Resource-aware Computing
Massively Parallel Processor Architectures for Resource-aware Computing Vahid Lari, Alexandru Tanase, Frank Hannig, and Jürgen Teich Hardware/Software Co-Design, Department of Computer Science Friedrich-Alexander
More informationMany-Core Computing Era and New Challenges. Nikos Hardavellas, EECS
Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019
More informationEnergy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS
Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationERCBench An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing
ERCBench An Open-Source Benchmark Suite for Embedded and Reconfigurable Computing Daniel Chang Chris Jenkins, Philip Garcia, Syed Gilani, Paula Aguilera, Aishwarya Nagarajan, Michael Anderson, Matthew
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationGeneration of Multigrid-based Numerical Solvers for FPGA Accelerators
Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System
More informationUsing FPGAs as Microservices
Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationDr. Yassine Hariri CMC Microsystems
Dr. Yassine Hariri Hariri@cmc.ca CMC Microsystems 03-26-2013 Agenda MCES Workshop Agenda and Topics Canada s National Design Network and CMC Microsystems Processor Eras: Background and History Single core
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationAcceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays
Acceleration of Optical Flow Computations on Tightly-Coupled Processor Arrays Éricles Rodrigues Sousa 1, Alexandru Tanase 1,VahidLari 1, Frank Hannig 1,Jürgen Teich 1, Johny Paul 2, Walter Stechele 2,
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationA Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning
A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific
More informationHyPer-sonic Combined Transaction AND Query Processing
HyPer-sonic Combined Transaction AND Query Processing Thomas Neumann Technische Universität München December 2, 2011 Motivation There are different scenarios for database usage: OLTP: Online Transaction
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationDesign methodology for multi processor systems design on regular platforms
Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline
More informationTen Reasons to Optimize a Processor
By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor
More informationCo-synthesis and Accelerator based Embedded System Design
Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer
More informationHypervisors at Hyperscale
Hypervisors at Hyperscale ARM, Xen, Servers and Evolution of the Data Center Larry Wikelius Co-Founder & VP Software 1 Overview l Market Dynamics l Technology Trends l Roadmaps Where are we today l Use
More informationBUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE
E-Guide BUYING SERVER HARDWARE FOR A SCALABLE VIRTUAL INFRASTRUCTURE SearchServer Virtualization P art 1 of this series explores how trends in buying server hardware have been influenced by the scale-up
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Data Centric Systems and Networking Emergence of Big Data Shift of Communication Paradigm From end-to-end to data
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests
ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationWas ist dran an einer spezialisierten Data Warehousing platform?
Was ist dran an einer spezialisierten Data Warehousing platform? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Data warehousing, Exadata, specialized hardware proprietary hardware Introduction
More informationTECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING
TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Table of Contents: The Accelerated Data Center Optimizing Data Center Productivity Same Throughput with Fewer Server Nodes
More informationFastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs
1/29 FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu 2/29 Claim FPGA overlay NoCs
More informationArchitecture-Conscious Database Systems
Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query
More information2. TOPOLOGICAL PATTERN ANALYSIS
Methodology for analyzing and quantifying design style changes and complexity using topological patterns Jason P. Cain a, Ya-Chieh Lai b, Frank Gennari b, Jason Sweis b a Advanced Micro Devices, 7171 Southwest
More informationThe Challenges of System Design. Raising Performance and Reducing Power Consumption
The Challenges of System Design Raising Performance and Reducing Power Consumption 1 Agenda The key challenges Visibility for software optimisation Efficiency for improved PPA 2 Product Challenge - Software
More informationA Novel Deadlock Avoidance Algorithm and Its Hardware Implementation
A ovel Deadlock Avoidance Algorithm and Its Hardware Implementation + Jaehwan Lee and *Vincent* J. Mooney III Hardware/Software RTOS Group Center for Research on Embedded Systems and Technology (CREST)
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationScaling Datacenter Accelerators With Compute-Reuse Architectures
Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA Scaling Datacenter Accelerators With Compute-Reuse Architectures
More informationBalancing DRAM Locality and Parallelism in Shared Memory CMP Systems
Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationNoC Simulation in Heterogeneous Architectures for PGAS Programming Model
NoC Simulation in Heterogeneous Architectures for PGAS Programming Model Sascha Roloff, Andreas Weichslgartner, Frank Hannig, Jürgen Teich University of Erlangen-Nuremberg, Germany Jan Heißwolf Karlsruhe
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationMEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS
MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing
More informationAccelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card
Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card The Rise of MongoDB Summary One of today s growing database
More informationSPARK: A Parallelizing High-Level Synthesis Framework
SPARK: A Parallelizing High-Level Synthesis Framework Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder
ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationEffective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management
International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,
More informationChapter 8: Memory-Management Strategies
Chapter 8: Memory-Management Strategies Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationAddressing the Memory Wall
Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the
More informationHyPer-sonic Combined Transaction AND Query Processing
HyPer-sonic Combined Transaction AND Query Processing Thomas Neumann Technische Universität München October 26, 2011 Motivation - OLTP vs. OLAP OLTP and OLAP have very different requirements OLTP high
More informationReconfigurable Computing. Design and Implementation. Chapter 4.1
Design and Implementation Chapter 4.1 Prof. Dr.-Ing. Jürgen Teich Lehrstuhl für Hardware-Software-Co-Design In System Integration System Integration Rapid Prototyping Reconfigurable devices (RD) are usually
More informationUnderstanding and Improving the Cost of Scaling Distributed Event Processing
Understanding and Improving the Cost of Scaling Distributed Event Processing Shoaib Akram, Manolis Marazakis, and Angelos Bilas shbakram@ics.forth.gr Foundation for Research and Technology Hellas (FORTH)
More informationAdvanced RDMA-based Admission Control for Modern Data-Centers
Advanced RDMA-based Admission Control for Modern Data-Centers Ping Lai Sundeep Narravula Karthikeyan Vaidyanathan Dhabaleswar. K. Panda Computer Science & Engineering Department Ohio State University Outline
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationChapter 8: Main Memory
Chapter 8: Main Memory Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel 32 and 64-bit Architectures Example:
More informationWhen, Where & Why to Use NoSQL?
When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),
More informationComputer Architecture. What is it?
Computer Architecture Venkatesh Akella EEC 270 Winter 2005 What is it? EEC270 Computer Architecture Basically a story of unprecedented improvement $1K buys you a machine that was 1-5 million dollars a
More informationLarge-Scale Network Simulation Scalability and an FPGA-based Network Simulator
Large-Scale Network Simulation Scalability and an FPGA-based Network Simulator Stanley Bak Abstract Network algorithms are deployed on large networks, and proper algorithm evaluation is necessary to avoid
More informationCHAPTER 8: MEMORY MANAGEMENT. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.
CHAPTER 8: MEMORY MANAGEMENT By I-Chen Lin Textbook: Operating System Concepts 9th Ed. Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the
More informationPRESENTATION TITLE GOES HERE. Understanding Architectural Trade-offs in Object Storage Technologies
Object Storage 201 PRESENTATION TITLE GOES HERE Understanding Architectural Trade-offs in Object Storage Technologies SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA
More informationThe Bifrost GPU architecture and the ARM Mali-G71 GPU
The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our
More informationM7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle
M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.
More informationZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS
ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200
More informationAn Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki
An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationBig Data It s not just for Google Any More
Big Data It s not just for Google Any More The Software and Compelling Economics of Big Data Computing EXECUTIVE SUMMARY Big Data holds out the promise of providing businesses with differentiated competitive
More informationInstruction Encoding Synthesis For Architecture Exploration
Instruction Encoding Synthesis For Architecture Exploration "Compiler Optimizations for Code Density of Variable Length Instructions", "Heuristics for Greedy Transport Triggered Architecture Interconnect
More informationParallelism. CS6787 Lecture 8 Fall 2017
Parallelism CS6787 Lecture 8 Fall 2017 So far We ve been talking about algorithms We ve been talking about ways to optimize their parameters But we haven t talked about the underlying hardware How does
More informationCover TBD. intel Quartus prime Design software
Cover TBD intel Quartus prime Design software Fastest Path to Your Design The Intel Quartus Prime software is revolutionary in performance and productivity for FPGA, CPLD, and SoC designs, providing a
More informationAMD Disaggregates the Server, Defines New Hyperscale Building Block
AMD Disaggregates the Server, Defines New Hyperscale Building Block Fabric Based Architecture Enables Next Generation Data Center Optimization Executive Summary AMD SeaMicro s disaggregated server enables
More informationThe Design and Implementation of a Low-Latency On-Chip Network
The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current
More informationdesigning a GPU Computing Solution
designing a GPU Computing Solution Patrick Van Reeth EMEA HPC Competency Center - GPU Computing Solutions Saturday, May the 29th, 2010 1 2010 Hewlett-Packard Development Company, L.P. The information contained
More informationStorage Networking Strategy for the Next Five Years
White Paper Storage Networking Strategy for the Next Five Years 2018 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public Information. Page 1 of 8 Top considerations for storage
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationDCBench: a Data Center Benchmark Suite
DCBench: a Data Center Benchmark Suite Zhen Jia ( 贾禛 ) http://prof.ict.ac.cn/zhenjia/ Institute of Computing Technology, Chinese Academy of Sciences workshop in conjunction with CCF October 31,2013,Guilin
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationDecentralized Distributed Storage System for Big Data
Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Outline Trends in Big and Cloud Storage
More informationConfigurable String Matching Hardware for Speeding up Intrusion Detection
Configurable String Matching Hardware for Speeding up Intrusion Detection Monther Aldwairi, Thomas Conte, Paul Franzon Dec 6, 2004 North Carolina State University {mmaldwai, conte, paulf}@ncsu.edu www.ece.ncsu.edu/erl
More informationScalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089
More informationDesign Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures
Design Space Exploration and Application Autotuning for Runtime Adaptivity in Multicore Architectures Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Outline Research challenges in multicore
More informationSoftware and Tools for HPE s The Machine Project
Labs Software and Tools for HPE s The Machine Project Scalable Tools Workshop Aug/1 - Aug/4, 2016 Lake Tahoe Milind Chabbi Traditional Computing Paradigm CPU DRAM CPU DRAM CPU-centric computing 2 CPU-Centric
More informationOCP Engineering Workshop - Telco
OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationLow-Power Processor Solutions for Always-on Devices
Low-Power Processor Solutions for Always-on Devices Pieter van der Wolf MPSoC 2014 July 7 11, 2014 2014 Synopsys, Inc. All rights reserved. 1 Always-on Mobile Devices Mobile devices on the move Mobile
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationFCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow
FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture
More informationScaleArc for SQL Server
Solution Brief ScaleArc for SQL Server Overview Organizations around the world depend on SQL Server for their revenuegenerating, customer-facing applications, running their most business-critical operations
More informationPortable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.
Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the
More informationChapter 8: Main Memory. Operating System Concepts 9 th Edition
Chapter 8: Main Memory Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Background Swapping Contiguous Memory Allocation Segmentation Paging Structure of the Page Table Example: The Intel
More informationReactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches
Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches Nikos Hardavellas Michael Ferdman, Babak Falsafi, Anastasia Ailamaki Carnegie Mellon and EPFL Data Placement in Distributed
More informationAbstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight
ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group
More informationCHAPTER 8 - MEMORY MANAGEMENT STRATEGIES
CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide
More informationOLAP Introduction and Overview
1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata
More information