Parallelized Progressive Network Coding with Hardware Acceleration
|
|
- Hugh O’Neal’
- 5 years ago
- Views:
Transcription
1 Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto
2 Network coding Information is coded at potentially every node
3 Network coding Information is coded at potentially every node [a] 1 [b] 2 [a] [b] 3 [a] 4 [b] [b]
4 Network coding Information is coded at potentially every node [a] 1 [b] 2 [a] [b] 3 4 [a] [b] 5 [a] 6 7 2
5 Network coding Information is coded at potentially every node [a] 1 [b] 2 [a] [b] 3 [a] 4 [a+b] [b] [a+b] 5 [a+b] 6 7 2
6 Randomized network coding [Ho et al. 2003] data flow blocks h e h e l Encode Decode l l o l o segments encoded block: Operates in GF(2 8 ) 166 h+191 e+216 l+109 l+237 o Coding coefficients are randomly generated 3
7 To code or not to code on a network node? 4
8 To code or not to code on a network node? In content distribution applications [Gkantsidis et al. INFOCOM 2005] Simplify the protocols Improve resilience to network dynamics 4
9 To code or not to code on a network node? In content distribution applications [Gkantsidis et al. INFOCOM 2005] Simplify the protocols Improve resilience to network dynamics In wireless networks [Katti et al. SIGCOMM 2006] Naturally leverage wireless broadcast channels Alleviate congested locality 4
10 To code or not to code on a network node? In content distribution applications [Gkantsidis et al. INFOCOM 2005] Simplify the protocols Improve resilience to network dynamics In wireless networks [Katti et al. SIGCOMM 2006] Naturally leverage wireless broadcast channels Alleviate congested locality Do advantages come for free? High coding complexity [Wang and Li, IWQoS 2006] Communication overhead of carrying coefficients 4
11 Our contributions
12 Coming (almost) free to a desktop near you 6
13 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance 6
14 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity 6
15 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks 6
16 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks To be discussed 6
17 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks To be discussed Future use of network coding fully justified 6
18 Coming (almost) free to a desktop near you The first implementation of randomized network coding with a focus on high performance To alleviate the number one challenge of network coding: complexity Using a combination of three tricks To be discussed Future use of network coding fully justified To tradeoff processor power for coding benefits 6
19 20x performance boost over a baseline implementation before this work Example after optimization 348 Mbps (64 blocks, 32 KB each) on a Power Mac G5 Quad (circa October 2005)
20 Progressive decoding as blocks arrive b = C 1 x T
21 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data 9
22 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data
23 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data
24 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data
25 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data
26 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data
27 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data
28 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data
29 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data
30 Progressive decoding: Gauss-Jordan elimination Decoding time overlaps with time to receive blocks coding coefficients data h e l l o 9
31 Hardware acceleration using SIMD vector instructions (x86 SSE2 & PowerPC AltiVec) b = C 1 x T
32 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] 11
33 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access 11
34 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access Basic building block: executed within tight nested loops 11
35 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access Basic building block: executed within tight nested loops Impossible to accelerate with SIMD vector instructions 11
36 Baseline implementation: the bottleneck Multiplication on GF(2 8 ) exp[log[x] + log[y]] Requires 3 table lookups: expensive memory access Basic building block: executed within tight nested loops Impossible to accelerate with SIMD vector instructions Changing to GF(2 16 ) does not help 11
37 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; 12
38 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop 12
39 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop Slower than table lookups without acceleration 12
40 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop Slower than table lookups without acceleration But: can be accelerated with SIMD vector instructions to operate on 16 bytes concurrently with 128-bit registers 12
41 Solution: multiply using loop-based approach while (x!= 0) if (x & 1!= 0) result = result ^ y; overflowing = y & 0x80; y = y << 1; if (overflowing) y = y ^ 0x1d; x = x >> 1; Mainly bit shifting and XORs in a loop Slower than table lookups without acceleration But: can be accelerated with SIMD vector instructions to operate on 16 bytes concurrently with 128-bit registers Challenge: 16-byte alignment of memory allocation is OSspecific 12
42 SIMD instructions: available everywhere 13
43 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) 13
44 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) AMD processors: SSE2 available since Athlon64 and Opteron (2004) 13
45 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) AMD processors: SSE2 available since Athlon64 and Opteron (2004) IBM PowerPC family of processors: AltiVec available in PowerPC G4 and PowerPC G5 (since 2001) 13
46 SIMD instructions: available everywhere Intel processors: SSE2 (Streaming SIMD Extensions) available since Pentium IV (2001) AMD processors: SSE2 available since Athlon64 and Opteron (2004) IBM PowerPC family of processors: AltiVec available in PowerPC G4 and PowerPC G5 (since 2001) Our implementation supports all the above SIMD instruction sets running on Mac OS X, Microsoft Windows, and Linux 13
47 Speedup with SIMD acceleration: decoding Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) block size (bytes) 14
48 Decoding performance with SIMD acceleration Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) 8 MBytes/second block size (bytes) 15
49 Parallelized implementation to utilize multi-core processors b = C 1 x T
50 Partitioning the decoding of coded blocks n n c' 1 c' 1 c' 2 c' 2 n c' 3 c' 3 n thread 1 k x' 1 thread 2 x' 2 x' 3 n k/2 k/2 17
51 Speedup of threaded decoding Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) block size (bytes) 18
52 Performance of threaded decoding Quad Xeon (n=256) Quad G5 (n=256) Quad Xeon (n=128) Quad G5 (n=128) Quad Xeon (n=64) Quad G5 (n=64) MBytes/second block size (bytes) 19
53 Partitioning both coefficients and coded blocks n/2 n/2 c' 1 n c' 2 c' 3 thread 1 n k x' 1 thread 2 x' 2 x' 3 n k/2 k/2 20
54 Performance of aggressive threading Performance (MB/sec) Quad Xeon (1024B per block) Dual Xeon (1024B per block) Quad G5 (1024B per block) Speedup number of blocks 21
55 Coding performance on various platforms System Quad PowerPC G5 2.5 GHz Quad P4 Xeon 2.8 GHz Dual Opteron (AMD) 2.4 GHz Dual P4 Xeon 3.6 GHz imac Intel Core Duo 1.83 GHz Intel Core Duo 1.66 GHz Platform setup OS SIMD type # of threads Coding rate (MB/s) L2 Cache Encoding Decoding Mac OS X AltiVec 4 1 MB Linux SSE KB Linux SSE2 2 1 MB Linux SSE2 2 2 MB Mac OS X SSE2 2 Windows XP SSE2 2 2 MB (shared) 2 MB (shared) blocks of 4 KB each 22
56 The lunch is almost free 23
57 The lunch is almost free Network coding is almost free with current processors: 23
58 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 23
59 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each 23
60 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each Encode and decode as you progress 23
61 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each Encode and decode as you progress Support all modern processors (Intel, AMD, PowerPC) and OSes (Windows, Mac OS X, Linux) 23
62 The lunch is almost free Network coding is almost free with current processors: 1248 Mbps for 16 blocks of 32 KB each 348 Mbps for 64 blocks of 32 KB each Encode and decode as you progress Support all modern processors (Intel, AMD, PowerPC) and OSes (Windows, Mac OS X, Linux) Don t forget the Moore s Law! 23
63 To revisit this presentation (PDF or Flash) iqua.ece.toronto.edu Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto
Pushing the Envelope: Extreme Network Coding on the GPU
Pushing the Envelope: Extreme Network Coding on the GPU Abstract While it is well known that network coding achieves optimal flow rates in multicast sessions, its potential for practical use has remained
More informationRandom Network Coding on the iphone: Fact or Fiction?
Random Network Coding on the iphone: Fact or Fiction? Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto ABSTRACT In multi-hop wireless networks, random
More informationSystem Requirements. SuccessMaker 7
System Requirements SuccessMaker 7 Copyright 2015 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and SuccessMaker are registered trademarks, in
More informationMINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2
MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2 NB: please read this page carefully, as it contains 4 separate specifications for a Workstation
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationSystem Requirements. SuccessMaker 8
System Requirements SuccessMaker 8 Copyright 2015 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and SuccessMaker are registered trademarks, in
More informationScreaming Fast Galois Field Arithmetic Using Intel SIMD Instructions
Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions Ethan L. Miller Center for Research in Storage Systems University of California, Santa Cruz (and Pure Storage) Authors Jim Plank Univ.
More informationMinimum Hardware and OS Specifications
Hardware and OS Specifications File Stream Document Management Software System Requirements for v4.5 NB: please read through carefully, as it contains 4 separate specifications for a Workstation PC, a
More information7/28/ Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc.
Technology in Action Technology in Action Chapter 9 Behind the Scenes: A Closer Look a System Hardware Chapter Topics Computer switches Binary number system Inside the CPU Cache memory Types of RAM Computer
More informationx86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures
x86 Architectures; Assembly Language Basics of Assembly language for the x86 and x86_64 architectures topics Preliminary material a look at what Assembly Language works with - How processors work»a moment
More informationHISTORY OF MICROPROCESSORS
HISTORY OF MICROPROCESSORS CONTENTS Introduction 4-Bit Microprocessors 8-Bit Microprocessors 16-Bit Microprocessors 1 32-Bit Microprocessors 64-Bit Microprocessors 2 INTRODUCTION Fairchild Semiconductors
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationProfiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA Motivation Scientific experiments are generating large amounts of data Education
More informationComputer System Architecture
CSC 203 1.5 Computer System Architecture Budditha Hettige Department of Statistics and Computer Science University of Sri Jayewardenepura Microprocessors 2011 Budditha Hettige 2 Processor Instructions
More informationLava: A Reality Check of Network Coding in Peer-to-Peer Live Streaming
1 Lava: A Reality Check of Network Coding in Peer-to-Peer Live Streaming Mea Wang, Baochun Li Department of Electrical and Computer Engineering University of Toronto {mea, bli}@eecg.toronto.edu Abstract
More informationScreaming Fast Galois Field Arithmetic Using Intel SIMD Instructions. James S. Plank USENIX FAST. University of Tennessee
Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions James S. Plank University of Tennessee USENIX FAST San Jose, CA February 15, 2013. Authors Jim Plank Tennessee Kevin Greenan EMC/Data
More informationA Framework for Memory Hierarchies
Associativity schemes Scheme Number of sets Blocks per set Direct mapped Number of blocks in cache 1 Set associative Blocks in cache / Associativity Associativity (2-8) Fully associative 1 Number Blocks
More informationHow to Write Fast Code , spring th Lecture, Mar. 31 st
How to Write Fast Code 18-645, spring 2008 20 th Lecture, Mar. 31 st Instructor: Markus Püschel TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay (Fred) Introduction Parallelism: definition Carrying
More informationA Case Study in Optimizing GNU Radio s ATSC Flowgraph
A Case Study in Optimizing GNU Radio s ATSC Flowgraph Presented by Greg Scallon and Kirby Cartwright GNU Radio Conference 2017 Thursday, September 14 th 10am ATSC FLOWGRAPH LOADING 3% 99% 76% 36% 10% 33%
More informationFundamentals of Computer Systems
Fundamentals of Computer Systems Caches Martha A. Kim Columbia University Fall 215 Illustrations Copyright 27 Elsevier 1 / 23 Computer Systems Performance depends on which is slowest: the processor or
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationAnalyzing Cache Bandwidth on the Intel Core 2 Architecture
John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms
More informationSystem Requirements. SuccessMaker 3
System Requirements SuccessMaker 3 System requirements are subject to change. For the latest information on system requirements, go to http://support.pearsonschool.com. For more information about Digital
More informationAdvanced Processor Architecture
Advanced Processor Architecture Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE2030: Introduction to Computer Systems, Spring 2018, Jinkyu Jeong
More informationTechnical Specifications and Hardware Requirements
Technical Specifications and Hardware Requirements Insight Legal Software Ltd. Westmead House, Westmead, Farnborough, Hampshire, GU14 7LP 01252 518939 info@insightlegal.co.uk www.insightlegal.co.uk VAT
More informationAMD Elite A-Series APU Desktop LAUNCHING JUNE 4 TH PLACE YOUR ORDERS TODAY!
AMD Elite A-Series APU Desktop LAUNCHING JUNE 4 TH PLACE YOUR ORDERS TODAY! INTRODUCING THE APU: DIFFERENT THAN CPUS A new AMD led category of processor APUs Are Their OWN Category + = Up to 779 GFLOPS
More informationLocally Deployed System Requirements SuccessMaker 10 DRAFT 3/31/2017
3/31/2017 March 31, 2017 Copyright 2017 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and SuccessMaker are registered trademarks, in the U.S.
More informationAutodesk Revit Structure 2012 System Requirements and Recommendations. Minimum: Entry-level configuration. Operating System Microsoft Windows 7 32-bit
Autodesk Revit Structure 2012 System s and Recommendations Minimum: Entry-level configuration Operating System Microsoft Windows 7 32-bit Microsoft Windows Vista 32-bit (SP2 or later) Business Microsoft
More informationCopyright 2009 by Scholastic Inc. All rights reserved. Published by Scholastic Inc. PDF0090 (PDF)
Enterprise Edition Version 1.9 System Requirements and Technology Overview The Scholastic Achievement Manager (SAM) is the learning management system and technology platform for all Scholastic Enterprise
More informationLocally Deployed System Requirements. SuccessMaker
Document last updated July 2, 2018 Copyright 2018 Pearson Education, Inc. or one or more of its direct or indirect affiliates. All rights reserved. Pearson and are registered trademarks, in the U.S. and/or
More informationLet s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.
Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow. Big problems and Very Big problems in Science How do we live Protein
More informationHow to write powerful parallel Applications
How to write powerful parallel Applications 08:30-09.00 09.00-09:45 09.45-10:15 10:15-10:30 10:30-11:30 11:30-12:30 12:30-13:30 13:30-14:30 14:30-15:15 15:15-15:30 15:30-16:00 16:00-16:45 16:45-17:15 Welcome
More informationFig 7.30 The Cache Mapping Function. Memory Fields and Address Translation
7-47 Chapter 7 Memory System Design Fig 7. The Mapping Function Example: KB MB CPU Word Block Main Address Mapping function The cache mapping function is responsible for all cache operations: Placement
More informationCPU Architecture Overview. Varun Sampath CIS 565 Spring 2012
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationCompiling for Nehalem (the Intel Core Microarchitecture, Intel Xeon 5500 processor family and the Intel Core i7 processor)
Compiling for Nehalem (the Intel Core Microarchitecture, Intel Xeon 5500 processor family and the Intel Core i7 processor) Martyn Corden Developer Products Division Software & Services Group Intel Corporation
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationAssembly Language. Lecture 2 - x86 Processor Architecture. Ahmed Sallam
Assembly Language Lecture 2 - x86 Processor Architecture Ahmed Sallam Introduction to the course Outcomes of Lecture 1 Always check the course website Don t forget the deadline rule!! Motivations for studying
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationSWAR: MMX, SSE, SSE 2 Multiplatform Programming
SWAR: MMX, SSE, SSE 2 Multiplatform Programming Relatore: dott. Matteo Roffilli roffilli@csr.unibo.it 1 What s SWAR? SWAR = SIMD Within A Register SIMD = Single Instruction Multiple Data MMX,SSE,SSE2,Power3DNow
More information1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola
1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device
More informationCalendar Description
ECE212 B1: Introduction to Microprocessors Lecture 1 Calendar Description Microcomputer architecture, assembly language programming, memory and input/output system, interrupts All the instructions are
More informationCube Base Reference Guide Cube Base CUBE BASE VERSION 6.4.4
Cube Base Reference Guide Cube Base CUBE BASE VERSION 6.4.4 1 Introduction System requirements of Cube, outlined in this section, include: Recommended workstation configuration Recommended server configuration
More informationFundamentals of Computer Systems
Fundamentals of Computer Systems Caches Stephen A. Edwards Columbia University Summer 217 Illustrations Copyright 27 Elsevier Computer Systems Performance depends on which is slowest: the processor or
More informationWhat does Heterogeneity bring?
What does Heterogeneity bring? Ken Koch Scientific Advisor, CCS-DO, LANL LACSI 2006 Conference October 18, 2006 Some Terminology Homogeneous Of the same or similar nature or kind Uniform in structure or
More informationAdvanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University
Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationEfficient Parallelized Network Coding for P2P File Sharing Applications
Efficient Parallelized Network Coding for P2P File Sharing Applications Karam Park 1, Joon-Sang Park 2, Won W. Ro 1 1 School of Electrical and Electronic Engineering Yonsei University, Seoul, Korea {riopark,
More informationSage 100 Premium Version 2017 Supported Platform Matrix Created as of October 28, 2016
The information in this document applies to Sage 100 Premium Version 2017. Detailed product update information and support policies can be found on the Sage Support web site at: https://support.na.sage.com/.
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More informationIs Random Network Coding Helpful in WiMAX?
Is Random Network Coding Helpful in WiMAX? Jin Jin, Baochun Li Department of Electrical and Computer Engineering University of Toronto {jinjin, bli}@eecg.toronto.edu Taegon Kong LG Electronics, Inc. tgkong@lge.com
More informationThe Use of Finite Field GF(256) in the Performance Primitives Intel IPP
The Use of Finite Field GF() in the Performance Primitives Intel IPP Software & Service Group/ VCSD/CIP/IPP Sergey Kirillov Oct, 00 Agenda Short IPP review GF() operations being in focus Methods for implementation
More informationImplementation of Random Linear Network Coding using NVIDIA's CUDA toolkit
Implementation of Random Linear Network Coding using NVIDIA's CUDA toolkit Péter Vingelmann* and Frank H. P. Fitzek * Budapest University of Technology and Economics Aalborg University, Department of Electronic
More informationMICROPROCESSOR TECHNOLOGY
MICROPROCESSOR TECHNOLOGY Assis. Prof. Hossam El-Din Moustafa Lecture 20 Ch.10 Intel Core Duo Processor Architecture 2-Jun-15 1 Chapter Objectives Understand the concept of dual core technology. Look inside
More informationAssembly Language. Lecture 2 x86 Processor Architecture
Assembly Language Lecture 2 x86 Processor Architecture Ahmed Sallam Slides based on original lecture slides by Dr. Mahmoud Elgayyar Introduction to the course Outcomes of Lecture 1 Always check the course
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationAdvanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Advanced Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000
More informationMapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis
Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis University of Thessaly Greece 1 Outline Introduction to AVS
More informationPresented by: Nafiseh Mahmoudi Spring 2017
Presented by: Nafiseh Mahmoudi Spring 2017 Authors: Publication: Type: ACM Transactions on Storage (TOS), 2016 Research Paper 2 High speed data processing demands high storage I/O performance. Flash memory
More informationCS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions
CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Guest Lecturer: Alan Christopher 3/08/2014 Spring 2014 -- Lecture #19 1 Neuromorphic Chips Researchers at IBM and
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationSage ERP MAS 200 SQL Version 4.50 Supported Platform Matrix Revised as of July 24, 2014
The information in this document applies to Sage ERP MAS 200 SQL Version 4.50.Detailed product update information and support policies can be found on the Customer Portal at: http://na.sage.com/log-on/
More informationA+ Guide to Hardware, 4e. Chapter 4 Processors and Chipsets
A+ Guide to Hardware, 4e Chapter 4 Processors and Chipsets Objectives Learn about the many different processors used for personal computers and notebook computers Learn about chipsets and how they work
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationSYSTEM BUS AND MOCROPROCESSORS HISTORY
SYSTEM BUS AND MOCROPROCESSORS HISTORY Dr. M. Hebaishy momara@su.edu.sa http://colleges.su.edu.sa/dawadmi/fos/pages/hebaishy.aspx Digital Logic Design Ch1-1 SYSTEM BUS The CPU sends various data values,
More informationMemory Hierarchy Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory.
Memory Hierarchy Goal: Fast, unlimited storage at a reasonable cost per bit. Recall the von Neumann bottleneck - single, relatively slow path between the CPU and main memory. Cache - 1 Typical system view
More informationCIT 668: System Architecture. Computer Systems Architecture
CIT 668: System Architecture Computer Systems Architecture 1. System Components Topics 2. Bandwidth and Latency 3. Processor 4. Memory 5. Storage 6. Network 7. Operating System 8. Performance Implications
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationOperating System Support for Shared-ISA Asymmetric Multi-core Architectures
Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Tong Li, Paul Brett, Barbara Hohlt, Rob Knauerhase, Sean McElderry, Scott Hahn Intel Corporation Contact: tong.n.li@intel.com
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationSage 100 Standard Version 2017 Supported Platform Matrix Created as of October 25, 2016
The information in this document applies to Sage 100 Standard Version 2017. Detailed product update information and support policies can be found on the Sage Support website at: https://support.na.sage.com/.
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationChapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition
Chapter 8: Memory- Management Strategies Operating System Concepts 9 th Edition Silberschatz, Galvin and Gagne 2013 Chapter 8: Memory Management Strategies Background Swapping Contiguous Memory Allocation
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationKampala August, Agner Fog
Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler
More informationPaging & Segmentation
& Frédéric Haziza Department of Computer Systems Uppsala University Spring 2008 Outline 1 Paging Implementation Protection Sharing 2 Setup Implementation 2 OSKomp 08 Paging & Definition
More informationThis Material Was All Drawn From Intel Documents
This Material Was All Drawn From Intel Documents A ROAD MAP OF INTEL MICROPROCESSORS Hao Sun February 2001 Abstract The exponential growth of both the power and breadth of usage of the computer has made
More informationLow Complexity Opportunistic Decoder for Network Coding
Low Complexity Opportunistic Decoder for Network Coding Bei Yin, Michael Wu, Guohui Wang, and Joseph R. Cavallaro ECE Department, Rice University, 6100 Main St., Houston, TX 77005 Email: {by2, mbw2, wgh,
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationTechnical Brief: Specifying a PC for Mascot
Technical Brief: Specifying a PC for Mascot Matrix Science 8 Wyndham Place London W1H 1PP United Kingdom Tel: +44 (0)20 7723 2142 Fax: +44 (0)20 7725 9360 info@matrixscience.com http://www.matrixscience.com
More informationReview of Last Lecture. CS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions. Great Idea #4: Parallelism.
CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 1 Review of Last Lecture Amdahl s Law limits benefits of parallelization Request Level Parallelism
More informationapplications with SIMD and Hyper-Threading Technology by Chew Yean Yam Intel Corporation
White Paper Intel Digital Security Surveillance Optimizing Video ompression for Intel Digital Security Surveillance applications with SIMD and Hyper-Threading Technology by hew Yean Yam Intel orporation
More informationSpring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand
Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled
More information( ZIH ) Center for Information Services and High Performance Computing. Overvi ew over the x86 Processor Architecture
( ZIH ) Center for Information Services and High Performance Computing Overvi ew over the x86 Processor Architecture Daniel Molka Ulf Markwardt Daniel.Molka@tu-dresden.de ulf.markwardt@tu-dresden.de Outline
More informationNow we are going to speak about the CPU, the Central Processing Unit.
Now we are going to speak about the CPU, the Central Processing Unit. The central processing unit or CPU is the component that executes the instructions of the program that is stored in the computer s
More informationCisco Unified Personal Communicator 7.0
Cisco Unified Personal Communicator 7.0 Cisco Unified Communications Solutions unify voice, video, data, and mobile applications on fixed and mobile networks, enabling easy collaboration every time from
More informationComputer Systems. Communication (networks, radio links) Meatware (people, users don t forget them)
Computers are useful machines, but they are generally useless by themselves. Computers are usually part of a system a computer system includes: Hardware (machines) Software (programs, applications) Communication
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationIntroduction to High-Performance Computing
Introduction to High-Performance Computing Simon D. Levy BIOL 274 17 November 2010 Chapter 12 12.1: Concurrent Processing High-Performance Computing A fancy term for computers significantly faster than
More informationCS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Intel SIMD Instructions
CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Intel SIMD Instructions Instructor: Justin Hsia 3/08/2013 Spring 2013 Lecture #19 1 Review of Last Lecture Amdahl s Law limits benefits
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationChaCha, a variant of Salsa20
ChaCha, a variant of Salsa20 Daniel J. Bernstein Department of Mathematics, Statistics, and Computer Science (M/C 249) The University of Illinois at Chicago Chicago, IL 60607 7045 snuffle6@box.cr.yp.to
More informationPC I/O. May 7, Howard Huang 1
PC I/O Today wraps up the I/O material with a little bit about PC I/O systems. Internal buses like PCI and ISA are critical. External buses like USB and Firewire are becoming more important. Today also
More information