A Hybrid Implementation of Hamming Weight

Size: px
Start display at page:

Download "A Hybrid Implementation of Hamming Weight"

Transcription

1 A Hybrid Implementation of Hamming Weight Enric Morancho Computer Architecture Department Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain 22 nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Torino, Italy, Feb. 12 nd 14 th, 2014 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

2 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

3 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

4 Introduction What is hamming weight? The hamming weight of a bitstring is the number of bits set to one in the bitstring Hamming weight is also known as population count, sideways addition or bit counting Applications: cryptography, chemical informatics, information theory Bitstring lengths up to several thousands of bits Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

5 Introduction Algorithms for computing hamming weight Several algorithms have been proposed: Naïve, memoization, parallel reduction, merged parallel reduction, bitslicing,... Some algorithms admit both scalar and vector implementations However, the existing implementations expose either scalar parallelism or vector parallelism. This work proposes an hybrid scalar-vector implementation Exposes both parallelisms simultaenously Useful on platforms that can exploit both parallelisms simultaneously Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

6 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

7 Existing algorithms Naïve Iterates through the bits of the bitstring and accumulates each bit value Can be specialized to deal with sparse/dense bitstrings Poor performance due to not exploiting parallelism uint8_t hw_naive(uint32_t w) { uint8_t i, cnt=0; } for (i=0; i<32; i++, w = w>>1) cnt += w&0x1; return(cnt); Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

8 Existing algorithms Memoization Steps: Defines a subword size (e.g. 8 bits) Precomputes the hamming weight of all possible subwords Looks up the precomputacion table for each subword of the bitstring and accumulates the results Admits both scalar and vector implementations Exposes more parallelism than naïve implementation uint8_t T8[256] = {0, 1, 1, 2,..., 7, 8}; uint8_t hw_memoization8(uint32_t w) { return(t8[w&0xff] + T8[(w>>8)&0xFF] + T8[(w>>16)&0xFF] + T8[w>>24]); } Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

9 Existing algorithms Parallel reduction at bit level Tree reduction of the input word in log 2 bits per word levels. Input Parallel reduction: level Parallel reduction: level Parallel reduction: level Admits both scalar and vector implementations uint32_t hw_parallel(uint32_t w) { w = (w & 0x ) + ((w>> 1) & 0x ); /*Lev. 1*/ w = (w & 0x ) + ((w>> 2) & 0x ); /*L2*/ w = (w & 0x0F0F0F0F) + ((w>> 4) & 0x0F0F0F0F); /*L3*/ w = (w & 0x00FF00FF) + ((w>> 8) & 0x00FF00FF); /*L4*/ w = (w & 0x0000FFFF) + ((w>>16) & 0x0000FFFF); /*L5*/ return(w); } Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

10 Existing algorithms Merged parallel reduction (or tree merging) Deals with bitstrings larger than a word Merges the intermediate results of several parallel reductions keeps processing just the combined result. The degree of merging is limited by the widths of the accumulators Admits both scalar and vector implementations Example: merged parallel reduction of 3 words (wa wb bc) wa = (wa & 0x ) + ((wa>> 1) & 0x ); /*L1*/ wb = (wb & 0x ) + ((wb>> 1) & 0x ); wa = wa + ( wc & 0x ); wb = wb + ((wc>>1) & 0x ); wa = (wa & 0x ) + ((wa>> 2) & 0x ); /*L2*/ wb = (wb & 0x ) + ((wb>> 2) & 0x ); wa = wa + wb; wa = (wa & 0x0F0F0F0F) + ((wa>> 4) & 0x0F0F0F0F); /*L3*/... Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

11 Existing algorithms Bitslicing Transforms a (2 n 1)-word bitstring into n words, preserving indeed the hamming weight of the original bitstring. The implementation relies on the parallel emulation of bits_per_word bit adders by using bit-wise logical instructions. Admits both scalar and vector implementations 2 n 2 i=0 n 1 hw(w i ) = 2 j hw(s j ) Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35 j=0

12 Existing algorithms Processor support Some processors offer a machine instruction to compute the hamming weight of a machine word For instance: Mark II (1954), IBM Stretch (1961), CDC 6600 (1964), Cray 1 (1976), Sun SPARCv9 (1995), Alpha 21264A (1999), IBM Power5 (2004) and ARM Cortex-A8 (2005) Since 2007, x86 processors supporting SSE4.2 offer popcnt instruction Computes the hamming weight of a scalar 32-bit or a 64-bit register AMD 15h Intel Nehalem Sandy Bridge/Haswell 32-bit 64-bit 32/64 bit Latency (cycles) Dispatch rate (inst/cyc) Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

13 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

14 Evaluation of existing implementations Evaluation environment Our benchmark consists in computing the hamming weight of several randomly initialized bitstrings Bitstring words are located in consecutive memory locations We evaluate two scenarios: Uncached Cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

15 Evaluation of existing implementations Evaluation environment Intel Core Intel Xeon i5-650 E5-2630L Microarchitecture Nehalem Sandy Bridge Frequency (max turbo) 3.2(3.46) GHz 2(2.5) GHz Cores 2 6 Reorder Buffer entries 128 µ-ops 168 µ-ops Scheduler entries 36 µ-ops 54 µ-ops Peak dispatch rate 6 µ-ops/cycle Size and assoc. 32KB, 8-way, 64Byte lines DL1 Bandwidth 128 bits/cycle 256 bits/cycle In-flight loads Simult. misses 10 L2 256KB, 8-way, 64Byte lines L3 4MB, 16-way, 64B 15MB, 20-way, 64B Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

16 Evaluation of existing implementations Evaluated implementations Naïve Mem-8 Mem-16 Par.Red. SSE4.2 Single-word wide implementations hw_naive implementation Memoization, 2 8 -entry lookup table Memoization, entry lookup table Parallel reduction at bit level over 64-bit words Uses 64-bit scalar instruction popcnt Multi-word wide implementations Merged Scalar merged par.red. on bit words at level 3 Merged-V Vector merged par.red. on bit words at level 3 (SSE2) Slice Scalar bit slicing on 7 64-bit words Slice-V Vector bit slicing on bit words (SSE2) Mem-4 Vector memoization, 2 4 -entry lookup table (SSSE3) Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

17 Evaluation of existing implementations Results on Nehalem platform: single-word wide/cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

18 Evaluation of existing implementations Results on Nehalem platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

19 Evaluation of existing implementations Results on Sandy Bridge platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

20 Evaluation of existing implementations Results SSE4.2 performs best Multi-word wide implementations outperform single-word implementations (but SSE4.2) Vector implementation outperform scalar implementation of the same algorithm Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

21 Evaluation of existing implementations Conclusions Although scalar SSE4.2 implementation performs best... The dispatch rate of popcnt instruction is just 1 inst./cycle, that is, SSE4.2 s peak performace is 8 bytes/cycle But DL1 bandwidht is 16 bytes/cycle (Nehalem) and 32 bytes/cycle (Sandy Bridge) SSE4.2 implementation is fully scalar and can not exploit the unused dispatch ports to dispatch vector instructions We wonder if SSE4.2 implementation may be outperformed by a hybrid implementation that makes use of both vector and scalar instructions Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

22 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

23 Proposed hybrid implementation Design Main idea: combining SSE4.2 (scalar) and Mem-4 (vector) implementations into a hybrid implementation Distribute the bitstring words into the scalar and the vector functional units Steps Iterate through the bitstring, each loop iteration processes a fixed sized chunk Statically distribute the chunk bytes between the scalar and vector functional units Design-space dimensions: Number of chunk bytes processed by the scalar units (S) Number of chunk bytes processed by the vector units (V) Design-space exploration Configurations (S,V) with chunk-length up to 80 bytes (16,16), (32,16), (16,32), (48,16), (32,32), (16,48), (64,16), (48,32), (32,48) and (16, 64). Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

24 Design-space exploration Nehalem platform: uncached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

25 Design-space exploration Nehalem platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

26 Design-space exploration Sandy Bridge platform: uncached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

27 Design-space exploration Sandy Bridge platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

28 Design-space exploration Conclusions Some hybrid configurations outperform SSE4.2 Performance potential is bigger in Sandy Bridge than in Nehalem The best hybrid configuration depends on the bitstring length However, we pick only one configuration for each platform: (32,32) -Nehalem- and (32,48) -Sandy Bridge- Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

29 Results Sandy Bridge platform: uncached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

30 Results Sandy Bridge platform: cached Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

31 Results Sandy Bridge platform Speedup of (32,48) hybrid configuration with respect to SSE4.2 Bitstring length up to DL1 up to L2 up to L3 >L3 Uncached scenario Cached scenario Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

32 Outline 1 Introduction 2 Algorithms for computing hamming weight 3 Evaluation of existing implementations 4 Proposed hybrid implementation 5 Conclusion and future work Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

33 Conclusions Processors can exploit both scalar and vector parallelism but applications expose only one kind of parallelism Some processor resources are not fully exploited Applications that admit both scalar and vector implementations, may benefit from a hybrid implementation that exposes both kinds of parallelism simultaneously Case of study: hamming weight (32,48) hybrid configuration outperforms the, to the best of our knowledge, best implementation of hamming weight by up to 1.22X on Sandy Bridge platform Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

34 Future work Evaluating this technique on newer platforms (e.g. Haswell) AVX2: vector integer intructions, 256-bit vector registers Applying this technique to other problems Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

35 A Hybrid Implementation of Hamming Weight Enric Morancho Computer Architecture Department Universitat Politècnica de Catalunya, BarcelonaTech Barcelona, Spain 22 nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Torino, Italy, Feb. 12 nd 14 th, 2014 Enric Morancho (UPC) A Hybrid Implementation of Hamming Weight Feb / 35

A Hybrid Implementation of Hamming Weight

A Hybrid Implementation of Hamming Weight 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing A Hybrid Implementation of Hamming Weight Enric Morancho Departament d Arquitectura de Computadors Universitat

More information

Dan Stafford, Justine Bonnot

Dan Stafford, Justine Bonnot Dan Stafford, Justine Bonnot Background Applications Timeline MMX 3DNow! Streaming SIMD Extension SSE SSE2 SSE3 and SSSE3 SSE4 Advanced Vector Extension AVX AVX2 AVX-512 Compiling with x86 Vector Processing

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core

How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core How to Write Fast Numerical Code Spring 2013 Lecture: Architecture/Microarchitecture and Intel Core Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Technicalities Research project: Let

More information

Parallel Programming

Parallel Programming Parallel Programming Architectures Pt.1 Prof. Paolo Bientinesi HPAC, RWTH Aachen pauldj@aices.rwth-aachen.de WS16/17 Outline 1 Uniprocessor Architecture Review 2 Data Dependencies Prof. Paolo Bientinesi

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich 263-2300-00: How To Write Fast Numerical Code Assignment 4: 120 points Due Date: Th, April 13th, 17:00 http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-eth-spring17/course.html Questions: fastcode@lists.inf.ethz.ch

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Masterpraktikum Scientific Computing

Masterpraktikum Scientific Computing Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle

More information

Philippe Thierry Sr Staff Engineer Intel Corp.

Philippe Thierry Sr Staff Engineer Intel Corp. HPC@Intel Philippe Thierry Sr Staff Engineer Intel Corp. IBM, April 8, 2009 1 Agenda CPU update: roadmap, micro-μ and performance Solid State Disk Impact What s next Q & A Tick Tock Model Perenity market

More information

Implementing Lightweight Block Ciphers on x86 Architectures

Implementing Lightweight Block Ciphers on x86 Architectures Implementing Lightweight Block Ciphers on x86 Architectures Ryad Benadjila 1 Jian Guo 2 Victor Lomné 1 Thomas Peyrin 2 1 ANSSI, France 2 NTU, Singapore SAC, August 15, 2013 Talk Overview 1 Introduction

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Double-precision General Matrix Multiply (DGEMM)

Double-precision General Matrix Multiply (DGEMM) Double-precision General Matrix Multiply (DGEMM) Parallel Computation (CSE 0), Assignment Andrew Conegliano (A0) Matthias Springer (A00) GID G-- January, 0 0. Assumptions The following assumptions apply

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

XT Node Architecture

XT Node Architecture XT Node Architecture Let s Review: Dual Core v. Quad Core Core Dual Core 2.6Ghz clock frequency SSE SIMD FPU (2flops/cycle = 5.2GF peak) Cache Hierarchy L1 Dcache/Icache: 64k/core L2 D/I cache: 1M/core

More information

Improving Cache Performance

Improving Cache Performance Improving Cache Performance Computer Organization Architectures for Embedded Computing Tuesday 28 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition,

More information

Unit 8: Superscalar Pipelines

Unit 8: Superscalar Pipelines A Key Theme: arallelism reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode of next CIS 501: Computer Architecture Unit 8: Superscalar ipelines Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'ennsylvania'

More information

Cisco Ultra Packet Core High Performance AND Features. Aeneas Dodd-Noble, Principal Engineer Daniel Walton, Director of Engineering October 18, 2018

Cisco Ultra Packet Core High Performance AND Features. Aeneas Dodd-Noble, Principal Engineer Daniel Walton, Director of Engineering October 18, 2018 Cisco Ultra Packet Core High Performance AND Features Aeneas Dodd-Noble, Principal Engineer Daniel Walton, Director of Engineering October 18, 2018 The World s Top Networks Rely On Cisco Ultra 90+ 300M

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Left alignment Attractive font (sans serif, avoid Arial) Calibri,

More information

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

Evaluation of Intel Xeon Phi Knights Corner: Opportunities and Shortcomings ERLANGEN REGIONAL COMPUTING CENTER Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings J. Eitzinger 29.6.2016 Technologies Driving Performance Technology 1991 1992 1993 1994 1995

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2018 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

More information

Jackson Marusarz Intel

Jackson Marusarz Intel Jackson Marusarz Intel Agenda Motivation Threading Advisor Threading Advisor Workflow Advisor Interface Survey Report Annotations Suitability Analysis Dependencies Analysis Vectorization Advisor & Roofline

More information

Massively Parallel Phase Field Simulations using HPC Framework walberla

Massively Parallel Phase Field Simulations using HPC Framework walberla Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich

More information

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy

Agenda. Recap: Components of a Computer. Agenda. Recap: Cache Performance and Average Memory Access Time (AMAT) Recap: Typical Memory Hierarchy // CS 6C: Great Ideas in Computer Architecture (Machine Structures) Set- Associa+ve Caches Instructors: Randy H Katz David A PaFerson hfp://insteecsberkeleyedu/~cs6c/fa Cache Recap Recap: Components of

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

HW1 Solutions. Type Old Mix New Mix Cost CPI

HW1 Solutions. Type Old Mix New Mix Cost CPI HW1 Solutions Problem 1 TABLE 1 1. Given the parameters of Problem 6 (note that int =35% and shift=5% to fix typo in book problem), consider a strength-reducing optimization that converts multiplies by

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

WORTMANN AG IT Made in Germany

WORTMANN AG IT Made in Germany 1009343 - TERRA PC-BUSINESS 7100 i3770/8/ssd/sil+/w8p>w7 SiSoftware Sandra Lite 2012.SP1 (2012.01.18.24) Processor Arithmetic Aggregate Arithmetic Performance : 110.3GOPS Dhrystone SSE4.2 : 129GIPS Whetstone

More information

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

Single-Pass List Partitioning

Single-Pass List Partitioning Single-Pass List Partitioning Leonor Frias 1 Johannes Singler 2 Peter Sanders 2 1 Dep. de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya 2 Institut für Theoretische Informatik,

More information

Intel Knights Landing Hardware

Intel Knights Landing Hardware Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute

More information

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

Parallelized Hashing via j-lanes and j-pointers Tree Modes, with Applications to SHA-256

Parallelized Hashing via j-lanes and j-pointers Tree Modes, with Applications to SHA-256 Journal of Information Security, 2014, 5, 91-113 Published Online July 2014 in SciRes. http://www.scirp.org/journal/jis http://dx.doi.org/10.4236/jis.2014.53010 Parallelized Hashing via j-lanes and j-pointers

More information

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany

Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Motivation C AVX2 AVX512 New instructions utilized! Scalar performance

More information

Introducing Sandy Bridge

Introducing Sandy Bridge Introducing Sandy Bridge Bob Valentine Senior Principal Engineer 1 Sandy Bridge - Intel Next Generation Microarchitecture Sandy Bridge: Overview Integrates CPU, Graphics, MC, PCI Express* On Single Chip

More information

Faster Population Counts using AVX2 Instructions

Faster Population Counts using AVX2 Instructions Faster Population Counts using AVX2 Instructions Wojciech Mu la, Nathan Kurz and Daniel Lemire Université du Québec (TELUQ), Canada Email: lemire@gmail.com arxiv:1611.07612v4 [cs.ds] 19 Dec 2016 1. INTRODUCTION

More information

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5)

ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Instruction-Level Parallelism and its Exploitation: PART 1 ILP concepts (2.1) Basic compiler techniques (2.2) Reducing branch costs with prediction (2.3) Dynamic scheduling (2.4 and 2.5) Project and Case

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design

EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design EN164: Design of Computing Systems Topic 06.b: Superscalar Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville

Exploiting ILP with SW Approaches. Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Lecture : Exploiting ILP with SW Approaches Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Basic Pipeline Scheduling and Loop

More information

Lixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship

Lixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009 Work supported in part by NSF through Work supported in part by NSF through grants ST-HEC-0444285, CPA-0702245 and CPA-0811587, and

More information

Parallel Programming

Parallel Programming Parallel Programming Architectures Pt.1 Prof. Paolo Bientinesi HPAC, RWTH Aachen pauldj@aices.rwth-aachen.de WS17/18 Quick architecture review Prof. Paolo Bientinesi Parallel Programming 2 / 11 Clock,

More information

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization

CSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction

CS 61C: Great Ideas in Computer Architecture. Multiple Instruction Issue, Virtual Memory Introduction CS 61C: Great Ideas in Computer Architecture Multiple Instruction Issue, Virtual Memory Introduction Instructor: Justin Hsia 7/26/2012 Summer 2012 Lecture #23 1 Parallel Requests Assigned to computer e.g.

More information

5008: Computer Architecture

5008: Computer Architecture 5008: Computer Architecture Chapter 2 Instruction-Level Parallelism and Its Exploitation CA Lecture05 - ILP (cwliu@twins.ee.nctu.edu.tw) 05-1 Review from Last Lecture Instruction Level Parallelism Leverage

More information

Topic: A Deep Dive into Memory Access. Company: Intel Title: Software Engineer Name: Wang, Zhihong

Topic: A Deep Dive into Memory Access. Company: Intel Title: Software Engineer Name: Wang, Zhihong Topic: A Deep Dive into Memory Access Company: Intel Title: Software Engineer Name: Wang, Zhihong A Typical NFV Scenario: PVP Guest Forwarding Engine virtio vhost Forwarding Engine NIC Ring ops What s

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Floating Point/Multicycle Pipelining in DLX

Floating Point/Multicycle Pipelining in DLX Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or

More information

Virtual Memory. Virtual Memory

Virtual Memory. Virtual Memory Virtual Memory Virtual Memory Main memory is cache for secondary storage Secondary storage (disk) holds the complete virtual address space Only a portion of the virtual address space lives in the physical

More information

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16

4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3. Emil Sekerinski, McMaster University, Fall Term 2015/16 4. The Processor Computer Architecture COMP SCI 2GA3 / SFWR ENG 2GA3 Emil Sekerinski, McMaster University, Fall Term 2015/16 Instruction Execution Consider simplified MIPS: lw/sw rt, offset(rs) add/sub/and/or/slt

More information

Parallel Programming

Parallel Programming Parallel Programming Introduction Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias

More information

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth

More information

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士 Computer Architecture 计算机体系结构 Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2018 Review Hazards (data/name/control) RAW, WAR, WAW hazards Different types

More information

Compilation for Heterogeneous Platforms

Compilation for Heterogeneous Platforms Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey

More information

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!) 7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to

More information

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism

ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism ELEC 5200/6200 Computer Architecture and Design Fall 2016 Lecture 9: Instruction Level Parallelism Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University,

More information

Presented by: Nafiseh Mahmoudi Spring 2017

Presented by: Nafiseh Mahmoudi Spring 2017 Presented by: Nafiseh Mahmoudi Spring 2017 Authors: Publication: Type: ACM Transactions on Storage (TOS), 2016 Research Paper 2 High speed data processing demands high storage I/O performance. Flash memory

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

Architecture Exploration of High-Performance PCs with a Solid-State Disk

Architecture Exploration of High-Performance PCs with a Solid-State Disk Architecture Exploration of High-Performance PCs with a Solid-State Disk D. Kim, K. Bang, E.-Y. Chung School of EE, Yonsei University S. Yoon School of EE, Korea University April 21, 2010 1/53 Outline

More information

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths Y. Kodama, T. Odajima, M. Matsuda, M. Tsuji, J. Lee and M. Sato RIKEN AICS (Advanced Institute for Computational

More information

ARE WE OPTIMIZING HARDWARE FOR

ARE WE OPTIMIZING HARDWARE FOR ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science

More information

Block Size Tradeoff (1/3) Benefits of Larger Block Size. Lecture #22 Caches II Block Size Tradeoff (3/3) Block Size Tradeoff (2/3)

Block Size Tradeoff (1/3) Benefits of Larger Block Size. Lecture #22 Caches II Block Size Tradeoff (3/3) Block Size Tradeoff (2/3) CS61C L22 Caches II (1) inst.eecs.berkeley.edu/~cs61c CS61C Machine Structures CPS today! Lecture #22 Caches II 25-11-16 There is one handout today at the front and back of the room! Lecturer PSOE, new

More information