Instruction and Data Streams

Size: px

Start display at page:

Download "Instruction and Data Streams"

Osborn Todd
5 years ago
Views:

1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad Data Streams Istructio Streams Sigle Sigle SISD: Itel Petium 4 MISD: No examples today Data Streams SIMD: SSE istructios of x86 MIMD: Itel Xeo e SISD, MIMD, SIMD, SPMD, ad Vector Chapter 7 Multicores, Multiprocessors, ad Clusters 2

2 Istructio ad Data Streams Istructio Streams Sigle Sigle SISD: Itel Petium 4 MISD: No examples today Data Streams SIMD: SSE istructios of x86 MIMD: Itel Xeo e5345 SPMD: Sigle Program Data A parallel program o a MIMD computer Coditioal code for differet processors 7.6 SISD, MIMD, SIMD, SPMD, ad Vector Chapter 7 Multicores, Multiprocessors, ad Clusters 3 Itroductio SIMD architectures ca exploit sigificat datalevel parallelism for: matrix-orieted scietific computig media-orieted image ad soud processig Itroductio SIMD is more eergy efficiet tha MIMD oly eeds to fetch oe istructio per data operatio makes SIMD attractive for persoal mobile devices SIMD allows programmers to cotiue to thik sequetially Copyright 2012, Elsevier Ic. All rights reserved.

SIMD Parallelism Vector architectures (slides 5 to 19) SIMD & extesios (slides 20 to 30) Graphics Processor Uits (GPUs) (ext set) Itroductio For x86 processors: Expected grow: 2 more cores/chip/year

3 SIMD Parallelism Vector architectures (slides 5 to 19) SIMD & extesios (slides 20 to 30) Graphics Processor Uits (GPUs) (ext set) Itroductio For x86 processors: Expected grow: 2 more cores/chip/year SIMD width: 2x every 4 years Potetial speedup: SIMD 2x that from MIMD! Copyright 2012, Elsevier Ic. All rights reserved. 5 Basic idea: Read sets of data elemets (gather from memory) ito vector registers Operate o those registers Store/scatter the results back ito memory Registers are cotrolled by the compiler Used to hide memory latecy Leverage memory badwidth Copyright 2012, Elsevier Ic. All rights reserved. 6

Fully pipelied, ew op each clock-cycle Data & cotrol hazards are detected Vector load-store uit Fully pipelied 1 word/clock-cycle after iitial

4 AJProeça, Sistemas de Computação e Desempeho, MIf, UMiho, 2010/11 7 VMIPS Example architecture: VMIPS Loosely based o Cray-1 (ext slide) Vector registers Each register holds a 64-elemet, 64 bits/elemet vector Register file has 16 read ports ad 8 write ports Vector fuctioal uits Fully pipelied, ew op each clock-cycle Data & cotrol hazards are detected Vector load-store uit Fully pipelied 1 word/clock-cycle after iitial latecy Scalar registers 32 geeral-purpose registers 32 floatig-poit registers Crossbar switches Copyright 2012, Elsevier Ic. All rights reserved. 8

D LV ADDVV SV F0,a V1,Rx V2,V1,F0 V3,Ry V4,V2,V3 Ry,V4 VMIPS Istructios load scalar a load vector X vector-scalar multiply load vector Y add

5 Cray-1 Supercomputer (1976) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 9 ADDVV.D: add two vectors ADDVS.D: add vector to a scalar LV/SV: vector load ad vector store from address Example: DAXPY (Double-precisio A x X Plus Y) L.D LV MULVS.D LV ADDVV SV F0,a V1,Rx V2,V1,F0 V3,Ry V4,V2,V3 Ry,V4 VMIPS Istructios load scalar a load vector X vector-scalar multiply load vector Y add store the result Requires the executio of 6 istructios versus almost 600 for MIPS (assumig DAXPY is operatig o a vector with 64 elemets) Copyright 2012, Elsevier Ic. All rights reserved. 10

6 Vector Executio Time Executio time depeds o three factors: Legth of operad vectors Structural hazards Data depedecies VMIPS fuctioal uits cosume oe elemet per clock cycle Executio time is approximately the vector legth Covoy Set of vector istructios that could potetially execute together i oe uit of time, chime Copyright 2012, Elsevier Ic. All rights reserved. 11 Challeges Start up time Latecy of vector fuctioal uit Assume the same as Cray-1 Floatig-poit add => 6 clock cycles Floatig-poit multiply => 7 clock cycles Floatig-poit divide => 20 clock cycles Vector load => 12 clock cycles Improvemets: > 1 elemet per clock cycle (1) No-64 wide vectors (2) IF statemets i vector code (3) Memory system optimizatios to support vector processors (4) dimesioal matrices (5) Sparse matrices (6) Programmig a vector computer (7) Copyright 2012, Elsevier Ic. All rights reserved. 12

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23) Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction