Parallelized Progressive Network Coding with Hardware Acceleration

Size: px
Start display at page:

Download "Parallelized Progressive Network Coding with Hardware Acceleration"

Transcription

1 Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto

2 Network coding Information is coded at potentially every node

3 Network coding Information is coded at potentially every node [a] 1 [b] 2 [a] [b] 3 [a] 4 [b] [b]

4 Network coding Information is coded at potentially every node [a] 1 [b] 2 [a] [b] 3 4 [a] [b] 5 [a] 6 7 2

5 Network coding Information is coded at potentially every node [a] 1 [b] 2 [a] [b] 3 [a] 4 [a+b] [b] [a+b] 5 [a+b] 6 7 2

6 Randomized network coding [Ho et al. 2003] data flow blocks h e h e l Encode Decode l l o l o segments encoded block: Operates in GF(2 8 ) 166 h+191 e+216 l+109 l+237 o Coding coefficients are randomly generated 3

7 To code or not to code on a network node? 4

8 To code or not to code on a network node? In content distribution applications [Gkantsidis et al. INFOCOM 2005] Simplify the protocols Improve resilience to network dynamics 4

9 To code or not to code on a network node? In content distribution applications [Gkantsidis et al. INFOCOM 2005] Simplify the protocols Improve resilience to network dynamics In wireless networks [Katti et al. SIGCOMM 2006] Naturally leverage wireless broadcast channels Alleviate congested locality 4

10 To code or not to code on a network node? In content distribution applications [Gkantsidis et al. INFOCOM 2005] Simplify the protocols Improve resilience to network dynamics In wireless networks [Katti et al. SIGCOMM 2006] Naturally leverage wireless broadcast channels Alleviate congested locality Do advantages come for free? High coding complexity [Wang and Li, IWQoS 2006] Communication overhead of carrying coefficients 4

11 Our contributions

12 Coming (almost) free to a desktop near you 6

13 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance 6

14 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity 6

15 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks 6

16 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks To be discussed 6

17 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks To be discussed Future use of network coding fully justified 6

18 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks To be discussed Future use of network coding fully justified To tradeoff processor power for coding benefits 6

19 20x performance boost over a baseline implementation before this work Example after optimization 348 Mbps (64 blocks, 32 KB each) on a Power Mac G5 Quad (circa October 2005)

20 Progressive decoding as blocks arrive b = C 1 x T

21 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data 9

22 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

23 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

24 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

25 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

26 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

27 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

28 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

29 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data

30 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data h e l l o 9

31 Hardware acceleration using SIMD vector instructions (x86 SSE2 & PowerPC AltiVec) b = C 1 x T

32 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] 11

33 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access 11

34 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access Basic building block: executed within tight nested loops 11

35 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access Basic building block: executed within tight nested loops Impossible to accelerate with SIMD vector instructions 11

36 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access Basic building block: executed within tight nested loops Impossible to accelerate with SIMD vector instructions Changing to GF(2 16 ) does not help 11

37 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; 12

38 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop 12

39 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop Slower than table lookups without acceleration 12

40 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop Slower than table lookups without acceleration But: can be accelerated with SIMD vector instructions to operate on 16 bytes concurrently with 128-bit registers 12

41 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop Slower than table lookups without acceleration But: can be accelerated with SIMD vector instructions to operate on 16 bytes concurrently with 128-bit registers Challenge: 16-byte alignment of memory allocation is OSspecific 12

42 SIMD instructions: available everywhere 13

43 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) 13

44 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) AMD processors: SSE2 available since Athlon64 and Opteron (2004) 13

45 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) AMD processors: SSE2 available since Athlon64 and Opteron (2004) IBM PowerPC family of processors: AltiVec available in PowerPC G4 and PowerPC G5 (since 2001) 13

46 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) AMD processors: SSE2 available since Athlon64 and Opteron (2004) IBM PowerPC family of processors: AltiVec available in PowerPC G4 and PowerPC G5 (since 2001) Our implementation supports all the above SIMD instruction sets running on Mac OS X, Microsoft Windows, and Linux 13

47 Speedup with SIMD acceleration: decoding Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) block size (bytes) 14

48 Decoding performance with SIMD acceleration Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) 8 MBytes/second block size (bytes) 15

49 Parallelized implementation to utilize multi-core processors b = C 1 x T

50 Partitioning the decoding of coded blocks n n c' 1 c' 1 c' 2 c' 2 n c' 3 c' 3 n thread 1 k x' 1 thread 2 x' 2 x' 3 n k/2 k/2 17

51 Speedup of threaded decoding Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) block size (bytes) 18

52 Performance of threaded decoding Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) MBytes/second block size (bytes) 19

53 Partitioning both coefficients and coded blocks n/2 n/2 c' 1 n c' 2 c' 3 thread 1 n k x' 1 thread 2 x' 2 x' 3 n k/2 k/2 20

54 Performance of aggressive threading Performance (MB/sec) Quad Xeon (1024B per block) Dual Xeon (1024B per block) Quad G5 (1024B per block) Speedup number of blocks 21

55 Coding performance on various platforms System Quad PowerPC G5 2.5 GHz Quad P4 Xeon 2.8 GHz Dual Opteron (AMD) 2.4 GHz Dual P4 Xeon 3.6 GHz imac Intel Core Duo 1.83 GHz Intel Core Duo 1.66 GHz Platform setup OS SIMD type # of threads Coding rate (MB/s) L2 Cache Encoding Decoding Mac OS X AltiVec 4 1 MB Linux SSE KB Linux SSE2 2 1 MB Linux SSE2 2 2 MB Mac OS X SSE2 2 Windows XP SSE2 2 2 MB (shared) 2 MB (shared) blocks of 4 KB each 22

56 The lunch is almost free 23

57 The lunch is almost free Network coding is almost free with current processors: 23

58 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 23

59 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each 23

60 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each Encode and decode as you progress 23

61 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each Encode and decode as you progress Support all modern processors (Intel, AMD, PowerPC) and OSes (Windows, Mac OS X, Linux) 23

62 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each Encode and decode as you progress Support all modern processors (Intel, AMD, PowerPC) and OSes (Windows, Mac OS X, Linux) Don t forget the Moore s Law! 23

63 To revisit this presentation (PDF or Flash) iqua.ece.toronto.edu Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto

Pushing the Envelope: Extreme Network Coding on the GPU

Pushing the Envelope: Extreme Network Coding on the GPU Pushing the Envelope: Extreme Network Coding on the GPU Abstract While it is well known that network coding achieves optimal flow rates in multicast sessions, its potential for practical use has remained

More information

Random Network Coding on the iphone: Fact or Fiction?

Random Network Coding on the iphone: Fact or Fiction? Random Network Coding on the iphone: Fact or Fiction? Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto ABSTRACT In multi-hop wireless networks, random

More information

System Requirements. SuccessMaker 7

System Requirements. SuccessMaker 7 System Requirements SuccessMaker 7 Copyright 2015 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and SuccessMaker are registered trademarks, in

More information

MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2

MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2 MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2 NB: please read this page carefully, as it contains 4 separate specifications for a Workstation

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

System Requirements. SuccessMaker 8

System Requirements. SuccessMaker 8 System Requirements SuccessMaker 8 Copyright 2015 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and SuccessMaker are registered trademarks, in

More information

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions Ethan L. Miller Center for Research in Storage Systems University of California, Santa Cruz (and Pure Storage) Authors Jim Plank Univ.

More information

Minimum Hardware and OS Specifications

Minimum Hardware and OS Specifications Hardware and OS Specifications File Stream Document Management Software System Requirements for v4.5 NB: please read through carefully, as it contains 4 separate specifications for a Workstation PC, a

More information

7/28/ Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc.

7/28/ Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc. Technology in Action Technology in Action Chapter 9 Behind the Scenes: A Closer Look a System Hardware Chapter Topics Computer switches Binary number system Inside the CPU Cache memory Types of RAM Computer

More information

x86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures

x86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures x86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures topics Preliminary material a look at what Assembly Language works with - How processors work»a moment

More information

HISTORY OF MICROPROCESSORS

HISTORY OF MICROPROCESSORS HISTORY OF MICROPROCESSORS CONTENTS Introduction 4-Bit Microprocessors 8-Bit Microprocessors 16-Bit Microprocessors 1 32-Bit Microprocessors 64-Bit Microprocessors 2 INTRODUCTION Fairchild Semiconductors

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Motivation Scientific experiments are generating large amounts of data Education

More information

Computer System Architecture

Computer System Architecture CSC 203 1.5 Computer System Architecture Budditha Hettige Department of Statistics and Computer Science University of Sri Jayewardenepura Microprocessors 2011 Budditha Hettige 2 Processor Instructions

More information

Lava: A Reality Check of Network Coding in Peer-to-Peer Live Streaming

Lava: A Reality Check of Network Coding in Peer-to-Peer Live Streaming 1 Lava: A Reality Check of Network Coding in Peer-to-Peer Live Streaming Mea Wang, Baochun Li Department of Electrical and Computer Engineering University of Toronto {mea, bli}@eecg.toronto.edu Abstract

More information

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions. James S. Plank USENIX FAST. University of Tennessee

Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions. James S. Plank USENIX FAST. University of Tennessee Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions James S. Plank University of Tennessee USENIX FAST San Jose, CA February 15, 2013. Authors Jim Plank Tennessee Kevin Greenan EMC/Data

More information

A Framework for Memory Hierarchies

A Framework for Memory Hierarchies Associativity schemes Scheme Number of sets Blocks per set Direct mapped Number of blocks in cache 1 Set associative Blocks in cache / Associativity Associativity (2-8) Fully associative 1 Number Blocks

More information

How to Write Fast Code , spring th Lecture, Mar. 31 st

How to Write Fast Code , spring th Lecture, Mar. 31 st How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying

More information

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

A Case Study in Optimizing GNU Radio s ATSC Flowgraph A Case Study in Optimizing GNU Radio s ATSC Flowgraph Presented by Greg Scallon and Kirby Cartwright GNU Radio Conference 2017 Thursday, September 14 th 10am ATSC FLOWGRAPH LOADING 3% 99% 76% 36% 10% 33%

More information

Fundamentals of Computer Systems

Fundamentals of Computer Systems Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23 Computer Systems Performance depends on which is slowest: the processor or

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Analyzing Cache Bandwidth on the Intel Core 2 Architecture John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms

More information

System Requirements. SuccessMaker 3

System Requirements. SuccessMaker 3 System Requirements SuccessMaker 3 System requirements are subject to change. For the latest information on system requirements, go to http://support.pearsonschool.com. For more information about Digital

More information

Advanced Processor Architecture

Advanced Processor Architecture Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong

More information

Technical Specifications and Hardware Requirements

Technical Specifications and Hardware Requirements Technical Specifications and Hardware Requirements Insight Legal Software Ltd. Westmead House, Westmead, Farnborough, Hampshire, GU14 7LP 01252 518939 info@insightlegal.co.uk www.insightlegal.co.uk VAT

More information

AMD Elite A-Series APU Desktop LAUNCHING JUNE 4 TH PLACE YOUR ORDERS TODAY!

AMD Elite A-Series APU Desktop LAUNCHING JUNE 4 TH PLACE YOUR ORDERS TODAY! AMD Elite A-Series APU Desktop LAUNCHING JUNE 4 TH PLACE YOUR ORDERS TODAY! INTRODUCING THE APU: DIFFERENT THAN CPUS A new AMD led category of processor APUs Are Their OWN Category + = Up to 779 GFLOPS

More information

Locally Deployed System Requirements SuccessMaker 10 DRAFT 3/31/2017

Locally Deployed System Requirements SuccessMaker 10 DRAFT 3/31/2017 3/31/2017 March 31, 2017 Copyright 2017 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and SuccessMaker are registered trademarks, in the U.S.

More information

Autodesk Revit Structure 2012 System Requirements and Recommendations. Minimum: Entry-level configuration. Operating System Microsoft Windows 7 32-bit

Autodesk Revit Structure 2012 System Requirements and Recommendations. Minimum: Entry-level configuration. Operating System Microsoft Windows 7 32-bit Autodesk Revit Structure 2012 System s and Recommendations Minimum: Entry-level configuration Operating System Microsoft Windows 7 32-bit Microsoft Windows Vista 32-bit (SP2 or later) Business Microsoft

More information

Copyright 2009 by Scholastic Inc. All rights reserved. Published by Scholastic Inc. PDF0090 (PDF)

Copyright 2009 by Scholastic Inc. All rights reserved. Published by Scholastic Inc. PDF0090 (PDF) Enterprise Edition Version 1.9 System Requirements and Technology Overview The Scholastic Achievement Manager (SAM) is the learning management system and technology platform for all Scholastic Enterprise

More information

Locally Deployed System Requirements. SuccessMaker

Locally Deployed System Requirements. SuccessMaker Document last updated July 2, 2018 Copyright 2018 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and are registered trademarks, in the U.S. and/or

More information

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein

More information

How to write powerful parallel Applications

How to write powerful parallel Applications How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome

More information

Fig 7.30 The Cache Mapping Function. Memory Fields and Address Translation

Fig 7.30 The Cache Mapping Function. Memory Fields and Address Translation 7-47 Chapter 7 Memory System Design Fig 7. The Mapping Function Example: KB MB CPU Word Block Main Address Mapping function The cache mapping function is responsible for all cache operations: Placement

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Compiling for Nehalem (the Intel Core Microarchitecture, Intel Xeon 5500 processor family and the Intel Core i7 processor)

Compiling for Nehalem (the Intel Core Microarchitecture, Intel Xeon 5500 processor family and the Intel Core i7 processor) Compiling for Nehalem (the Intel Core Microarchitecture, Intel Xeon 5500 processor family and the Intel Core i7 processor) Martyn Corden Developer Products Division Software & Services Group Intel Corporation

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam

Assembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

SWAR: MMX, SSE, SSE 2 Multiplatform Programming

SWAR: MMX, SSE, SSE 2 Multiplatform Programming SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Calendar Description

Calendar Description ECE212 B1: Introduction to Microprocessors Lecture 1 Calendar Description Microcomputer architecture, assembly language programming, memory and input/output system, interrupts All the instructions are

More information

Cube Base Reference Guide Cube Base CUBE BASE VERSION 6.4.4

Cube Base Reference Guide Cube Base CUBE BASE VERSION 6.4.4 Cube Base Reference Guide Cube Base CUBE BASE VERSION 6.4.4 1 Introduction System requirements of Cube, outlined in this section, include: Recommended workstation configuration Recommended server configuration

More information

Fundamentals of Computer Systems

Fundamentals of Computer Systems Fundamentals of Computer Systems Caches Stephen A. Edwards Columbia University Summer 217 Illustrations Copyright 27 Elsevier Computer Systems Performance depends on which is slowest: the processor or

More information

What does Heterogeneity bring?

What does Heterogeneity bring? What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or

More information

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Efficient Parallelized Network Coding for P2P File Sharing Applications

Efficient Parallelized Network Coding for P2P File Sharing Applications Efficient Parallelized Network Coding for P2P File Sharing Applications Karam Park 1, Joon-Sang Park 2, Won W. Ro 1 1 School of Electrical and Electronic Engineering Yonsei University, Seoul, Korea {riopark,

More information

Sage 100 Premium Version 2017 Supported Platform Matrix Created as of October 28, 2016

Sage 100 Premium Version 2017 Supported Platform Matrix Created as of October 28, 2016 The information in this document applies to Sage 100 Premium Version 2017. Detailed product update information and support policies can be found on the Sage Support web site at: https://support.na.sage.com/.

More information

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign

IMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory

More information

Is Random Network Coding Helpful in WiMAX?

Is Random Network Coding Helpful in WiMAX? Is Random Network Coding Helpful in WiMAX? Jin Jin, Baochun Li Department of Electrical and Computer Engineering University of Toronto {jinjin, bli}@eecg.toronto.edu Taegon Kong LG Electronics, Inc. tgkong@lge.com

More information

The Use of Finite Field GF(256) in the Performance Primitives Intel IPP

The Use of Finite Field GF(256) in the Performance Primitives Intel IPP The Use of Finite Field GF() in the Performance Primitives Intel IPP Software & Service Group/ VCSD/CIP/IPP Sergey Kirillov Oct, 00 Agenda Short IPP review GF() operations being in focus Methods for implementation

More information

Implementation of Random Linear Network Coding using NVIDIA's CUDA toolkit

Implementation of Random Linear Network Coding using NVIDIA's CUDA toolkit Implementation of Random Linear Network Coding using NVIDIA's CUDA toolkit Péter Vingelmann* and Frank H. P. Fitzek * Budapest University of Technology and Economics Aalborg University, Department of Electronic

More information

MICROPROCESSOR TECHNOLOGY

MICROPROCESSOR TECHNOLOGY MICROPROCESSOR TECHNOLOGY Assis. Prof. Hossam El-Din Moustafa Lecture 20 Ch.10 Intel Core Duo Processor Architecture 2-Jun-15 1 Chapter Objectives Understand the concept of dual core technology. Look inside

More information

Assembly Language. Lecture 2 x86 Processor Architecture

Assembly Language. Lecture 2 x86 Processor Architecture Assembly Language Lecture 2 x86 Processor Architecture Ahmed Sallam Slides based on original lecture slides by Dr. Mahmoud Elgayyar Introduction to the course Outcomes of Lecture 1 Always check the course

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

More information

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis University of Thessaly Greece 1 Outline Introduction to AVS

More information

Presented by: Nafiseh Mahmoudi Spring 2017

Presented by: Nafiseh Mahmoudi Spring 2017 Presented by: Nafiseh Mahmoudi Spring 2017 Authors: Publication: Type: ACM Transactions on Storage (TOS), 2016 Research Paper 2 High speed data processing demands high storage I/O performance. Flash memory

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Sage ERP MAS 200 SQL Version 4.50 Supported Platform Matrix Revised as of July 24, 2014

Sage ERP MAS 200 SQL Version 4.50 Supported Platform Matrix Revised as of July 24, 2014 The information in this document applies to Sage ERP MAS 200 SQL Version 4.50.Detailed product update information and support policies can be found on the Customer Portal at: http://na.sage.com/log-on/

More information

A+ Guide to Hardware, 4e. Chapter 4 Processors and Chipsets

A+ Guide to Hardware, 4e. Chapter 4 Processors and Chipsets A+ Guide to Hardware, 4e Chapter 4 Processors and Chipsets Objectives Learn about the many different processors used for personal computers and notebook computers Learn about chipsets and how they work

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

Chapter 18 - Multicore Computers

Chapter 18 - Multicore Computers Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca

More information

SYSTEM BUS AND MOCROPROCESSORS HISTORY

SYSTEM BUS AND MOCROPROCESSORS HISTORY SYSTEM BUS AND MOCROPROCESSORS HISTORY Dr. M. Hebaishy momara@su.edu.sa http://colleges.su.edu.sa/dawadmi/fos/pages/hebaishy.aspx Digital Logic Design Ch1-1 SYSTEM BUS The CPU sends various data values,

More information

Memory Hierarchy Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory.

Memory Hierarchy Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory. Memory Hierarchy Goal: Fast, unlimited storage at a reasonable cost per bit. Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory. Cache - 1 Typical system view

More information

CIT 668: System Architecture. Computer Systems Architecture

CIT 668: System Architecture. Computer Systems Architecture CIT 668: System Architecture Computer Systems Architecture 1. System Components Topics 2. Bandwidth and Latency 3. Processor 4. Memory 5. Storage 6. Network 7. Operating System 8. Performance Implications

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Tong Li, Paul Brett, Barbara Hohlt, Rob Knauerhase, Sean McElderry, Scott Hahn Intel Corporation Contact: tong.n.li@intel.com

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Sage 100 Standard Version 2017 Supported Platform Matrix Created as of October 25, 2016

Sage 100 Standard Version 2017 Supported Platform Matrix Created as of October 25, 2016 The information in this document applies to Sage 100 Standard Version 2017. Detailed product update information and support policies can be found on the Sage Support website at: https://support.na.sage.com/.

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition Chapter 8: Memory- Management Strategies Operating System Concepts 9 th Edition Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

Paging & Segmentation

Paging & Segmentation & Frédéric Haziza Department of Computer Systems Uppsala University Spring 2008 Outline 1 Paging Implementation Protection Sharing 2 Setup Implementation 2 OSKomp 08 Paging & Definition

More information

This Material Was All Drawn From Intel Documents

This Material Was All Drawn From Intel Documents This Material Was All Drawn From Intel Documents A ROAD MAP OF INTEL MICROPROCESSORS Hao Sun February 2001 Abstract The exponential growth of both the power and breadth of usage of the computer has made

More information

Low Complexity Opportunistic Decoder for Network Coding

Low Complexity Opportunistic Decoder for Network Coding Low Complexity Opportunistic Decoder for Network Coding Bei Yin, Michael Wu, Guohui Wang, and Joseph R. Cavallaro ECE Department, Rice University, 6100 Main St., Houston, TX 77005 Email: {by2, mbw2, wgh,

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Technical Brief: Specifying a PC for Mascot

Technical Brief: Specifying a PC for Mascot Technical Brief: Specifying a PC for Mascot Matrix Science 8 Wyndham Place London W1H 1PP United Kingdom Tel: +44 (0)20 7723 2142 Fax: +44 (0)20 7725 9360 info@matrixscience.com http://www.matrixscience.com

More information

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.

Review of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism. CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 1 Review of Last Lecture Amdahl s Law limits benefits of parallelization Request Level Parallelism

More information

applications with SIMD and Hyper-Threading Technology by Chew Yean Yam Intel Corporation

applications with SIMD and Hyper-Threading Technology by Chew Yean Yam Intel Corporation White Paper Intel Digital Security Surveillance Optimizing Video ompression for Intel Digital Security Surveillance applications with SIMD and Hyper-Threading Technology by hew Yean Yam Intel orporation

More information

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled

More information

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture

( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture ( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline

More information

Now we are going to speak about the CPU, the Central Processing Unit.

Now we are going to speak about the CPU, the Central Processing Unit. Now we are going to speak about the CPU, the Central Processing Unit. The central processing unit or CPU is the component that executes the instructions of the program that is stored in the computer s

More information

Cisco Unified Personal Communicator 7.0

Cisco Unified Personal Communicator 7.0 Cisco Unified Personal Communicator 7.0 Cisco Unified Communications Solutions unify voice, video, data, and mobile applications on fixed and mobile networks, enabling easy collaboration every time from

More information

Computer Systems. Communication (networks, radio links) Meatware (people, users don t forget them)

Computer Systems. Communication (networks, radio links) Meatware (people, users don t forget them) Computers are useful machines, but they are generally useless by themselves. Computers are usually part of a system a computer system includes: Hardware (machines) Software (programs, applications) Communication

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

Introduction to High-Performance Computing

Introduction to High-Performance Computing Introduction to High-Performance Computing Simon D. Levy BIOL 274 17 November 2010 Chapter 12 12.1: Concurrent Processing High-Performance Computing A fancy term for computers significantly faster than

More information

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions

CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 3/08/2013 Spring 2013 Lecture #19 1 Review of Last Lecture Amdahl s Law limits benefits

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

ChaCha, a variant of Salsa20

ChaCha, a variant of Salsa20 ChaCha, a variant of Salsa20 Daniel J. Bernstein Department of Mathematics, Statistics, and Computer Science (M/C 249) The University of Illinois at Chicago Chicago, IL 60607 7045 snuffle6@box.cr.yp.to

More information

PC I/O. May 7, Howard Huang 1

PC I/O. May 7, Howard Huang 1 PC I/O Today wraps up the I/O material with a little bit about PC I/O systems. Internal buses like PCI and ISA are critical. External buses like USB and Firewire are becoming more important. Today also

More information