An MPSoC for Energy-Efficient Database Query Processing

Size: px
Start display at page:

Download "An MPSoC for Energy-Efficient Database Query Processing"

Transcription

1 Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis An MPSoC for Energy-Efficient Database Query Processing TensilicaDay 2016 Sebastian Haas Emil Matúš Gerhard Fettweis

2 Introduction Challenges 5G Mobile Networks Mobile Edge Computing Tactile Internet Internet of Things Database System Requirements Throughput Latency Energy MPSoC Application-Specific Customizable Energy-Efficient Slide 2

3 Database Systems SQL (Structured Query Language): BEGIN FOR X IN Loop SELECT * WHERE data=x END Loop END; Query Processing Big Data Query Throughput Direct interconnection to user and storage Query Latency Database Accelerator (DBA) Slide 3

4 Base Processor Tensilica Xtensa LX5 RISC Processor Local RAM XLMI 2x 32kB data 1x 32kB instruction 1 Load-Store unit (LSU) 32 bit data 32 bit instruction Slide 4

5 Extended Processor Tensilica Xtensa LX5 RISC Processor Database-Specific Instruction Set Local RAM XLMI 2x 32kB data 1x 32kB instruction 2 Load-Store units (LSU) 2x 128 bit data 64 bit FLIX 64 bit instruction Data Prefetcher Slide 5

6 DBA Applications Merge Sort Intersection Hashing <32 Bit Key> Bit Selection (32 Bit n Bit) Shuffle Network Result (n Bit) Bitmap Index Compression *0 1*1 5*0 1*0 2*1 4*0 1*1 1*0 2*0 1*1 Slide 6

7 Hashing: TIE Development unsigned int hash, shval, shval_neg; unsigned int mask = 0xFFFFFFFF; //init pointer, variables init_states(key, hashvalue, hashfunc); for(i=0; i<keysize; i++){ //load key, bit selection hash = key[i] & hashfunc; //extract bits for(j=30; j>=0; j--){ if(!(hashfunc & (0x1<<j))){ //partial shift right shval = hash & (mask<<j); shval_neg = hash & ~(mask<<j); hash = (shval>>1) shval_neg; } } //store hash value hashvalue[i] = hash; } Pure C code LD_0(); LD_1(); //load keys, extract bits, store hash values for(i=0; i<(keysize/16); i++){ LD_0(); LD_1(); HOP(); LD_0(); LD_1(); HOP(); ST_0(); ST_1(); } HOP(); ST_0(); ST_1(); C code with TIE instructions 1 cycle 1 cycle 1 cycle Slide 7

8 Hashing: Instruction Flow Local Data Memory 0 Dataflow HOP LD_0 Hash Func Load-Store Unit 0 Key_0 Key_1 Key_2 Key_3 HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. HASH Op. ST Result_0 Result_1 Result_2 Result_3 Result_4 Result_5 Result_6 Result_7 Load Execution Dataflow LD_1 Key_4 Key_5 Key_6 Key_7 Load-Store Unit 1 Load Local Data Memory 1 Slide 8

9 Hashing: Pipeline Snippet Cycle n Cycle (n+1) Cycle (n+2) Cycle (n+3) Cycle (n+4) Cycle (n+5) Cycle (n+6) Cycle (n+7) Cycle (n+8) ST_0 HOP ST_1 LD_0 LD_1 HOP ST_0 LD_0 LD_1 HOP LD_0 LD_1 ST_1 HOP ST_0 Latency: 6 cycles LD_0 LD_1 HOP ST_1 LD_0 LD_1 HOP LD_0 LD_1 Slide 9

10 Hashing: Results Final processor +1 Load-Store unit (2x) + TIE instructions (500x) Data bus: bit (2x) Throughput TT = nn kkkkkk tt n key : number of keys t: time to perform the operation Slide 10

11 Tomahawk Concept SW Application Start S1 T1 if S2 S3 T2 T5 Control Flow Control-Plane Subsystem Cache Application Processor Global Memory Core Manager Data-Plane Subsystem Local Memory PE1... Local Memory PEn T3 T4 End Data Flow... S1 S2 S2 S3 T1 T2 T3 T4 T2 T3 T4 T5 T3 T3 T2 PE3 PE2 T1 T2 T4 T4 T5 PE1 Dynamic out-of-order task dispatching Slide 11

12 Tomahawk3 MPSoC 28 nm CMOS SLP Globalfoundries Die Size: 18 mm² Tape-Out: Oct Core heterogeneous MPSoC (2 core types) Network-on-Chip: High-Speed Serial Data Link with 2 Routers Power Management: DVFS, AVS 2x LPDDR2 Memory Interface: 2x 64 MB SDRAM Processing Elements: Tensilica Xtensa LX5 Extended Instruction Set for Database Applications CoreManager: Optimized for query processing Tensilica Xtensa LX5 Slide 12

13 Single Core Performance Slide 13

14 System Level CoreManager Processing Element DMAC Core Global Memory Init Data Trf.: READ Start Data Transfer Data Transfer READ Data Transfer Start Core End Data Transfer Init Data Trf.: WRITE Start Data Transfer Data Transfer End Data Transfer Processing Core finished WRITE Data Transfer Slide 14

15 Scan Benchmark Scan operation scans data set with respect to a reference element (Filtering, Equality Search) Result of comparison is one bit TIE Instructions available for 4, 8, 16, and 32-bit input values Example: Input set with 8-bit elements Result set with single bits Scan reference element: Advantages High instruction level parallelism High data level parallelism Slide 15

16 Scan Benchmark: Results Throughput [Gbit/s] MHz: 500 MHz: Performance RISC RISC ASIP ASIP 100x 4x Power [mw] Power Consumption 250 MHz: 500 MHz: RISC RISC ASIP ASIP Number of Cores Number of Cores Speedup due to: Core Extensions/TIE: 100x 1 core 4 cores: 4x Power increase due to: Core Extensions/TIE: 10mW 1 core 4 cores: 190mW Slide 16

17 Scan Benchmark: Results Energy Efficiency [pj/bit] MHz: 500 MHz: Energy Efficiency RISC RISC ASIP ASIP 4x Energy decrease due to: Core Extensions/TIE: 100x 1 core 4 cores: negligible 500MHz 250 MHz 4x Number of Cores Slide 17

18 Benchmark Comparisons App. Scenario Scan WAH Indexing Tomahawk3 2x Intel Xeon E [1] 2x Intel Xeon E5430 [2] Tomahawk3 Intel i7-2600k [3] NVIDIA GTX-670 [3] Processing Cores Clock Freq. [GHz] Data Deposit Loc. Mem DRAM Cache DRAM DRAM Loc. Mem DRAM DRAM DRAM Total Throughput [Gbit/s] Total Power [W] DRAM Power [W] Power/Throughput [nj/bit] References: [1] F. Fusco, et al., Indexing Million of Packets per Second using GPUs, In Proceedings of the 2013 Conference on Internet Measurement Conference (IMC'13), [2] I. Psaroudakis, T. Kissinger, D. Porobic, T. Ilsche, E. Liarou, P. Tözün, A. Ailamaki, and W. Lehner. Dynamic fine-grained scheduling for energy-efficient main-memory queries. In Proceedings of the Tenth International Workshop on Data Management on New Hardware, DaMoN'14, [3] D. Tsirogiannis, S. Harizopoulos, and M. A. Shah. Analyzing the energy efficiency of a database server. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD'10, Slide 18

19 Tomahawk3 Demonstrator Acknowledgements: We would like to thank Cadence and Tensilica for providing software tools an IP as well as the Chair for Highly-Parallel VLSI-Systems and Neuromorphic Circuits for Backend-Design and PCB development of the Tomahawk3 chip. Slide 19

20 Thank you! References: [1] O. Arnold, S. Haas, G. Fettweis, B. Schlegel, T. Kissinger, W. Lehner: An Application-Specific Instruction Set for Accelerating Set-Oriented Database Primitives, SIGMOD [2] O. Arnold, S. Haas, G. Fettweis, B. Schlegel, T. Kissinger, T. Karnagel, W. Lehner: HASHI: An Application-Specific Instruction Set Extension for Hashing, ADMS [3] B. Nöthen et al.: A 105GOPS 36mm2 Heterogeneous SDR MPSoC with Energy-Aware Dynamic Scheduling and Iterative Detection-Decoding for 4G in 65nm CMOS. ISSCC 2014 [4] O. Arnold, E. Matus, B. Nöthen, M. Winter, T. Limberg and G. Fettweis: Tomahawk - Parallelism and Heterogeneity in Communications Signal Processing MPSoCs. TECS 2013 Slide 20

Overview on Hardware Optimizations for Database Engines

Overview on Hardware Optimizations for Database Engines Overview on Hardware Optimizations for Database Engines Annett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner BTW 2017, Stuttgart, Germany, 2017-03-09

More information

A Database Accelerator for Energy-Efficient Query Processing and Optimization

A Database Accelerator for Energy-Efficient Query Processing and Optimization A Database Accelerator for Energy-Efficient Query Processing and Optimization Sebastian Haas, Oliver Arnold, Stefan Scholze, Sebastian Höppner, Georg Ellguth, Andreas Dixius, Annett Ungethüm, Eric Mier,

More information

HW/SW-Database-CoDesign for Compressed Bitmap Index Processing

HW/SW-Database-CoDesign for Compressed Bitmap Index Processing HW/SW-Database-CoDesign for Compressed Bitmap Index Processing Sebastian Haas, Tomas Karnagel, Oliver Arnold, Erik Laux 1, Benjamin Schlegel 2, Gerhard Fettweis, Wolfgang Lehner Vodafone Chair Mobile Communications

More information

HW/SW Co-Design Lab. Seminar 1 WS 2017/2018. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G.

HW/SW Co-Design Lab. Seminar 1 WS 2017/2018. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis HW/SW Co-Design Lab Seminar 1 WS 2017/2018 TU Dresden, Slide 1 Schedule of the Lab Introduction: Lab task: Work: Become

More information

A Heterogeneous SDR MPSoC in 28 nm CMOS for Low-Latency Wireless Applications

A Heterogeneous SDR MPSoC in 28 nm CMOS for Low-Latency Wireless Applications A Heterogeneous SDR MPSoC in 28 nm CMOS for Low-Latency Wireless Applications Sebastian Haas 1, Tobias Seifert 1, Benedikt Nöthen 1, Stefan Scholze 1, Sebastian Höppner 1, Andreas Dixius 1, Esther Pérez

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Overview on Hardware Optimizations for Database Engines

Overview on Hardware Optimizations for Database Engines B. Mitschang et al. (Hrsg.): Datenbanksysteme für Business, Technologie und Web (BTW 2017), Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2017 383 Overview on Hardware Optimizations

More information

Energy Efficiency in Main-Memory Databases

Energy Efficiency in Main-Memory Databases B. Mitschang et al. (Hrsg.): BTW 2017 Workshopband, Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2017 335 Energy Efficiency in Main-Memory Databases Stefan Noll 1 Abstract: As

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

class 9 fast scans 1.0 prof. Stratos Idreos

class 9 fast scans 1.0 prof. Stratos Idreos class 9 fast scans 1.0 prof. Stratos Idreos HTTP://DASLAB.SEAS.HARVARD.EDU/CLASSES/CS165/ 1 pass to merge into 8 sorted pages (2N pages) 1 pass to merge into 4 sorted pages (2N pages) 1 pass to merge into

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

Anastasia Ailamaki. Performance and energy analysis using transactional workloads Performance and energy analysis using transactional workloads Anastasia Ailamaki EPFL and RAW Labs SA students: Danica Porobic, Utku Sirin, and Pinar Tozun Online Transaction Processing $2B+ industry Characteristics:

More information

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

Venezia: a Scalable Multicore Subsystem for Multimedia Applications Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and

More information

KiloCore: A 32 nm 1000-Processor Array

KiloCore: A 32 nm 1000-Processor Array KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation

More information

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc.

Gemini: Sanjiv Kapil. A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor. Gemini Architect Sun Microsystems, Inc. Gemini: A Power-efficient Chip Multi-Threaded (CMT) UltraSPARC Processor Sanjiv Kapil Gemini Architect Sun Microsystems, Inc. Design Goals Designed for compute-dense, transaction oriented systems (webservers,

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Xtensa. Andrew Mihal 290A Fall 2002

Xtensa. Andrew Mihal 290A Fall 2002 Xtensa Andrew Mihal 290A Fall 2002 1 Outline Introduction Single processor Xtensa system architecture Exporting a programming model for single processor Multiple processor system architecture Exporting

More information

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on

A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on A Reconfigurable Crossbar Switch with Adaptive Bandwidth Control for Networks-on on-chip Donghyun Kim, Kangmin Lee, Se-joong Lee and Hoi-Jun Yoo Semiconductor System Laboratory, Dept. of EECS, Korea Advanced

More information

Adaptive Query Processing on Prefix Trees Wolfgang Lehner

Adaptive Query Processing on Prefix Trees Wolfgang Lehner Adaptive Query Processing on Prefix Trees Wolfgang Lehner Fachgruppentreffen, 22.11.2012 TU München Prof. Dr.-Ing. Wolfgang Lehner > Challenges for Database Systems Three things are important in the database

More information

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp.

Networks for Multi-core Chips A A Contrarian View. Shekhar Borkar Aug 27, 2007 Intel Corp. Networks for Multi-core hips A A ontrarian View Shekhar Borkar Aug 27, 2007 Intel orp. 1 Outline Multi-core system outlook On die network challenges A simple contrarian proposal Benefits Summary 2 A Sample

More information

Computer Architecture. Introduction. Lynn Choi Korea University

Computer Architecture. Introduction. Lynn Choi Korea University Computer Architecture Introduction Lynn Choi Korea University Class Information Lecturer Prof. Lynn Choi, School of Electrical Eng. Phone: 3290-3249, 공학관 411, lchoi@korea.ac.kr, TA: 윤창현 / 신동욱, 3290-3896,

More information

CS 152, Spring 2011 Section 10

CS 152, Spring 2011 Section 10 CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table

Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table Fast IP Routing Lookup with Configurable Processor and Compressed Routing Table H. Michael Ji, and Ranga Srinivasan Tensilica, Inc. 3255-6 Scott Blvd Santa Clara, CA 95054 Abstract--In this paper we examine

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

C-Based Hardware Design Platform for Dynamically Reconfigurable Processor

C-Based Hardware Design Platform for Dynamically Reconfigurable Processor C-Based Hardware Design Platform for Dynamically Reconfigurable Processor September 22 nd, 2005 IPFlex Inc. Agenda Merits of C-Based hardware design Hardware enabling C-Based hardware design DAPDNA-FW

More information

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University Multithreaded Architectures and The Sort Benchmark Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University About our Sort Benchmark Based on the benchmark proposed in A measure

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

Rhythm: Harnessing Data Parallel Hardware for Server Workloads Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $ $ Duke CS # NVIDIA Explosive Internet Growth

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Programming Heterogeneous Embedded Systems for IoT

Programming Heterogeneous Embedded Systems for IoT Programming Heterogeneous Embedded Systems for IoT Jeronimo Castrillon Chair for Compiler Construction TU Dresden jeronimo.castrillon@tu-dresden.de Get-together toward a sustainable collaboration in IoT

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER

3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER 3D TECHNOLOGIES: SOME PERSPECTIVES FOR MEMORY INTERCONNECT AND CONTROLLER CODES+ISSS: Special session on memory controllers Taipei, October 10 th 2011 Denis Dutoit, Fabien Clermidy, Pascal Vivet {denis.dutoit@cea.fr}

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems Liang-Gee Chen Distinguished Professor General Director, SOC Center National Taiwan University DSP/IC Design Lab, GIEE, NTU 1

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Power 7. Dan Christiani Kyle Wieschowski

Power 7. Dan Christiani Kyle Wieschowski Power 7 Dan Christiani Kyle Wieschowski History 1980-2000 1980 RISC Prototype 1990 POWER1 (Performance Optimization With Enhanced RISC) (1 um) 1993 IBM launches 66MHz POWER2 (.35 um) 1997 POWER2 Super

More information

Introduction to GPU computing

Introduction to GPU computing Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU

More information

SoC Communication Complexity Problem

SoC Communication Complexity Problem When is the use of a Most Effective and Why MPSoC, June 2007 K. Charles Janac, Chairman, President and CEO SoC Communication Complexity Problem Arbitration problem in an SoC with 30 initiators: Hierarchical

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC- OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation

More information

Power Technology For a Smarter Future

Power Technology For a Smarter Future 2011 IBM Power Systems Technical University October 10-14 Fontainebleau Miami Beach Miami, FL IBM Power Technology For a Smarter Future Jeffrey Stuecheli Power Processor Development Copyright IBM Corporation

More information

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company

Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor. David Johnson Systems Technology Division Hewlett-Packard Company Techniques for Mitigating Memory Latency Effects in the PA-8500 Processor David Johnson Systems Technology Division Hewlett-Packard Company Presentation Overview PA-8500 Overview uction Fetch Capabilities

More information

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder

ESE532: System-on-a-Chip Architecture. Today. Programmable SoC. Message. Process. Reminder ESE532: System-on-a-Chip Architecture Day 5: September 18, 2017 Dataflow Process Model Today Dataflow Process Model Motivation Issues Abstraction Basic Approach Dataflow variants Motivations/demands for

More information

Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc.

Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Configurable s for SOC Design Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Why Listen to This Presentation? Understand how SOC design techniques, now nearly 20 years old, are

More information

Resource Efficiency of Scalable Processor Architectures for SDR-based Applications

Resource Efficiency of Scalable Processor Architectures for SDR-based Applications Resource Efficiency of Scalable Processor Architectures for SDR-based Applications Thorsten Jungeblut 1, Johannes Ax 2, Gregor Sievers 2, Boris Hübener 2, Mario Porrmann 2, Ulrich Rückert 1 1 Cognitive

More information

Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. G. Fettweis The Next Big Waves In The Information Technology Roadmap

Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. G. Fettweis The Next Big Waves In The Information Technology Roadmap Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. G. Fettweis The Next Big Waves In The Information Technology Roadmap Gerhard Fettweis Green Visiting Professor UBC Vodafone Chair Professor

More information

Join Processing for Flash SSDs: Remembering Past Lessons

Join Processing for Flash SSDs: Remembering Past Lessons Join Processing for Flash SSDs: Remembering Past Lessons Jaeyoung Do, Jignesh M. Patel Department of Computer Sciences University of Wisconsin-Madison $/MB GB Flash Solid State Drives (SSDs) Benefits of

More information

Reconfigurable Cell Array for DSP Applications

Reconfigurable Cell Array for DSP Applications Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell

More information

On mapping to multi/manycores

On mapping to multi/manycores On mapping to multi/manycores Jeronimo Castrillon Chair for Compiler Construction (CCC) TU Dresden, Germany MULTIPROG HiPEAC Conference Stockholm, 24.01.2017 Mapping for dataflow programming models MEM

More information

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit

Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit Oliver Arnold, Benedikt Noethen, and Gerhard Fettweis Vodafone Chair Mobile Communications Systems Dresden University of Technology

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Compiling for deeply embedded and heterogeneous signal processing systems

Compiling for deeply embedded and heterogeneous signal processing systems Compiling for deeply embedded and heterogeneous signal processing systems Jeronimo Castrillon Cfaed Chair for Compiler Construction (CCC) 5G Summit, Dresden, Germany September 29, 2016 Multi-Processor/core

More information

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1

Future of Interconnect Fabric A Contrarian View. Shekhar Borkar June 13, 2010 Intel Corp. 1 Future of Interconnect Fabric A ontrarian View Shekhar Borkar June 13, 2010 Intel orp. 1 Outline Evolution of interconnect fabric On die network challenges Some simple contrarian proposals Evaluation and

More information

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Low-power Architecture. By: Jonathan Herbst Scott Duntley Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media

More information

소프트웨어기반고성능침입탐지시스템설계및구현

소프트웨어기반고성능침입탐지시스템설계및구현 소프트웨어기반고성능침입탐지시스템설계및구현 KyoungSoo Park Department of Electrical Engineering, KAIST M. Asim Jamshed *, Jihyung Lee*, Sangwoo Moon*, Insu Yun *, Deokjin Kim, Sungryoul Lee, Yung Yi* Department of Electrical

More information

IWES st Italian Workshop on Embedded Systems Pisa September 2016

IWES st Italian Workshop on Embedded Systems Pisa September 2016 IWES 2016 1st Italian Workshop on Embedded Systems Pisa -- 19 September 2016 Research Group Overview Roberto Giorgi University of Siena, Italy http://www.dii.unisi.it/~giorgi Siena on Earth 2 Engineering

More information

HW/SW-Codesign Lab. Seminar 2 WS 2016/2017. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G.

HW/SW-Codesign Lab. Seminar 2 WS 2016/2017. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis HW/SW-Codesign Lab Seminar WS / TU Dresden, Slide CORE FEATURES TU Dresden HW/SW-Codesign Lab Slide corelx_hwswcd Xtensa

More information

HW/SW Co-Design Lab. Seminar 2 WS 2018/2019. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G.

HW/SW Co-Design Lab. Seminar 2 WS 2018/2019. chair. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis HW/SW Co-Design Lab Seminar WS 8/9 TU Dresden, Slide CORE FEATURES Slide corelx_hwswcd Xtensa LX ALU -bit MUL Load/Store

More information

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands Grassroots ASPLOS can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands ASPLOS-17 Doctoral Workshop London, March 4th, 2012 1 Current

More information

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email

More information

Interfacing a High Speed Crypto Accelerator to an Embedded CPU

Interfacing a High Speed Crypto Accelerator to an Embedded CPU Interfacing a High Speed Crypto Accelerator to an Embedded CPU Alireza Hodjat ahodjat @ee.ucla.edu Electrical Engineering Department University of California, Los Angeles Ingrid Verbauwhede ingrid @ee.ucla.edu

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

EE 457 Unit 7b. Main Memory Organization

EE 457 Unit 7b. Main Memory Organization 1 EE 457 Unit 7b Main Memory Organization 2 Motivation Organize main memory to Facilitate byte-addressability while maintaining Efficient fetching of the words in a cache block Low order interleaving (L.O.I)

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Orchestration: Turning material breakthroughs into application performance

Orchestration: Turning material breakthroughs into application performance Orchestration: Turning material breakthroughs into application performance Jeronimo Castrillon (on behalf of the Orchestration team) TU Dresden Cfaed Chair for Compiler Construction jeronimo.castrillon@tu-dresden.de

More information

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS Basics DRAM ORGANIZATION DRAM Word Line Bit Line Storage element (capacitor) In/Out Buffers Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array Page 1 Basics BUS

More information

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation Mainstream Computer System Components CPU Core 2 GHz - 3.0 GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation One core or multi-core (2-4) per chip Multiple FP, integer

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Aim High Intel Technical Update Teratec 07 Symposium June 20, 2007 Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group Risk Factors Today s s presentations contain forward-looking statements.

More information

Data Storage and Query Answering. Data Storage and Disk Structure (2)

Data Storage and Query Answering. Data Storage and Disk Structure (2) Data Storage and Query Answering Data Storage and Disk Structure (2) Review: The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM @200MHz) 6,400

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Mainstream Computer System Components

Mainstream Computer System Components Mainstream Computer System Components Double Date Rate (DDR) SDRAM One channel = 8 bytes = 64 bits wide Current DDR3 SDRAM Example: PC3-12800 (DDR3-1600) 200 MHz (internal base chip clock) 8-way interleaved

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Ten Reasons to Optimize a Processor

Ten Reasons to Optimize a Processor By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Hardware Acceleration for Database Systems using Content Addressable Memories

Hardware Acceleration for Database Systems using Content Addressable Memories Hardware Acceleration for Database Systems using Content Addressable Memories Nagender Bandi, Sam Schneider, Divyakant Agrawal, Amr El Abbadi University of California, Santa Barbara Overview The Memory

More information

A 1-GHz Configurable Processor Core MeP-h1

A 1-GHz Configurable Processor Core MeP-h1 A 1-GHz Configurable Processor Core MeP-h1 Takashi Miyamori, Takanori Tamai, and Masato Uchiyama SoC Research & Development Center, TOSHIBA Corporation Outline Background Pipeline Structure Bus Interface

More information

Flash Memory Summit 2011

Flash Memory Summit 2011 1 Billion cores Memory Summit 2011 Session 302: Nonvolatile Design Challenges and Methodologies The Processor s role in maximizing performance and reducing energy consumption Neil Robinson Tensilica At

More information

The Connected IT world

The Connected IT world ReCoSoC July2013, Darmstadt The Connected IT world Infrastructural core The Cloud! Sensory swarm Mobile access Source: J. Rabaey 7 trillion wireless devices in 2017, mobile traffic increase 60%/year until

More information

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5

ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5 ISSCC 2003 / SESSION 14 / MICROPROCESSORS / PAPER 14.5 14.5 A 600MHz Single-Chip Multiprocessor with 4.8GB/s Internal Shared Pipelined Bus and 512kB Internal Memory Satoshi Kaneko, Katsunori Sawai, Norio

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Effective System Design with ARM System IP

Effective System Design with ARM System IP Effective System Design with ARM System IP Mentor Technical Forum 2009 Serge Poublan Product Marketing Manager ARM 1 Higher level of integration WiFi Platform OS Graphic 13 days standby Bluetooth MP3 Camera

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

Design Space Exploration of Network Processor Architectures

Design Space Exploration of Network Processor Architectures Design Space Exploration of Network Processor Architectures ECE 697J December 3 rd, 2002 ECE 697J 1 Introduction Network processor architectures have many choices Number of processors Size of memory Type

More information

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 3: Caching (1) Welcome!

/INFOMOV/ Optimization & Vectorization. J. Bikker - Sep-Nov Lecture 3: Caching (1) Welcome! /INFOMOV/ Optimization & Vectorization J. Bikker - Sep-Nov 2015 - Lecture 3: Caching (1) Welcome! Today s Agenda: The Problem with Memory Cache Architectures Practical Assignment 1 INFOMOV Lecture 3 Caching

More information