Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics

Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing parallel computations without being visible to the user, e.g Registers Memory Parallel busses Instructions pipeline Macroscopic parallelism - duplicated large-scale components providing parallelism on system level Dual- or Quad-core processors Vector or Graphics processors Co-processors I/O processors

Parallelism Symmetric vs Asymmetric Symmetric parallelism uses replications of identical processing elements that can operate in parallel Multicore processors Asymmetric parallelism uses a set of processing elements that operate in parallel but differs PC with CPU, Graphics processor, math processor, I/O processor

Parallelism Fine-grain vs Coarse-grain Fine-grain parallelism computers providing parallel computations on the level of instructions or data items Vector processors Digital signal processors with special SIMD instructions Coarse-grain parallelism computers providing parallelism on the level of programs or larger data structures Dual- or Quad-core processors

Parallelism Explicit vs Implicit Explicit parallelism programmer need to control how available parallelism is exploited in the code, through e.g. partitioning into parallel processes, constraints and special instructions. Implicit parallelism hardware can exploit parallelism in the executed code without constraints or any special instructions defined by the programmer

Flynn s taxonomy 1966 Michael J Flynn proposed a classification of computers One Instruction streams Many Data streams One SISD: Single instruction stream Single data stream MISD: Multiple instruction streams Single data stream Many SIMD: Single instruction stream Multiple data streams MIMD: Multiple instruction streams Multiple data streams

Flynn s taxonomy - SISD Instructions Processor Capable of executing single instructions, operating on a single data stream E.g. conventional von-neumann architecture Data

Flynn s taxonomy - SIMD Instructions Processor Processor Processor Processor Processor Data Capable of executing the same instruction on all processing elements operating on different data streams E.g. vector processors

Flynn s taxonomy - MISD Instructions Processor Processor Processor Processor Processor Data Executes different instructions on each processing element operating on the same data stream. (Useful for only a limited amount of applications)

Flynn s taxonomy - MIMD Instructions Processor Processor Processor Processor Processor Processor Data Executes multiple instructions on multiple data streams E.g. multiprocessors

System Bus Architectures Reference Multi-master point to point communication over a single system bus requires bus arbitration. Processors, co-processors and DMA-controllers are typically operating as bus masters.

System Bus Architectures Reference Time multiplexing of data and addresses on common lines Lower cost Lower performance

System Bus Architectures Reference A computer could be designed for using multiple buses for different purposes Cheaper solution to include a bridge Typically used for e.g. USB or Ethernet

System Bus Architectures Fetch and Store paradigm Reference

System Bus Architectures Conclusions: A system bus can only perform one transfer at a time It is thus a limited resource for communication More than one master can compete for access to this resource. Processors, co-processors and DMAcontrollers How to mitigate limitations on communication over a system bus?

Switching fabrics Significantly more expensive then a system bus

AXI4 channel switch Reference: Xilinx User Guide 1037 Xilinx AXI4 bus is a derivate of the Arm AMBA bus developed for SoC applications. Picture is showing a switch for AXI4 Connects one or more similar AXI memory mapped masters to one or more similar memory mapped slaves.

AXI4 and AXI4-Lite bus A master is taking the initiative to a data transfer, slave is responding Consists of five channels: Read address channel Write address channel Read data channel Write data channel Write response channel Master Slave Data can move simultaneously in both directions. AXI4 allow for burst of 256 data transfers using only one address. AXI4-Lite allow only for single data transactions.

AXI4 bus read operation Reference: Xilinx User Guide 1037

AXI4 bus write operation Reference: Xilinx User Guide 1037

AXI4-stream Unidirectional streaming of data Master Slave

AXI4-stream implementation Reference: Xilinx User Guide 1037 Used for high speed data centric streaming applications, e.g. video TLAST indicates packet boundaries TVALID indicates valid data

AXI4-Stream Interconnect Reference: Xilinx User Guide 1037 Parallel routing of traffic between N masters and M slaves

Multiprocessor architectures Reference Challenges for multiprocessor architectures Communication Coordination Contention

Challenges Communication must be scalable to handle communication between large number of processors Coordination a strategy for how to distribute tasks among all processors is required Contention situations where two or more processors try to access a resource at the same time. This problem explodes with increasing number of processors In particular problems will occur with memory accesses Cashing can mitigate but introduces another problem, Cache coherence how to guarantee that cache memories, local to each processor carries the same data for common memory locations?

Using Peripheral Processors Reference

Performance of Multi-Processor architectures Reference =

Data Pipelining Input data stream Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Output data stream A pipeline divides a larger computational task into a series of smaller tasks Benefits: Smaller tasks are less complex to describe Allow for reuse of code modules Reveals coarse grained parallelism that can be mapped to a multi-processor architecture for increased throughput

Data Pipelining Input data stream Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Output data stream Necessary conditions: Partionable problem Low communication overhead Equivalent processor speed as for single processor

Data Pipelining Input data stream Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Output data stream > > > > = 1 [ ] Latency= + + + + + [time units]

Data Flow Graph Input data stream Actor 1 Actor 2 Actor 3 Output data stream Actor 4 A data flow graph is describing computations without including any information on how the computation is going to be done. Hence, only data flow and no control flow is described. This programming paradigm is supported by functional languages such as DFL suitable for digital signal processing systems and also ideal for capturing pipelined computations. Imperative languages such as C and C++ model both control- and data flow and are no good for capturing parallelism.

Data Pipelining on FPGA logic Input data stream CN D Q Output data stream > Clk A large combinatorial network is driving an output register Propagation delay time for CN is Max frequency for clock signal Clk then becomes =

Data Pipelining on FPGA logic Assume that CN is partionable into M smaller combinatorial networks Insert registers in between all combinatorial nets CN 1 D Q CN 2 D Q CN 3 D Q CN 4 D Q CN M D Q > > > > > Clk > > > > = 1 Latency=

Power in computational logic The dynamic energy consumed when changing state of a cmos logic output = 1 2 is the total capacitive load of the output is the supply voltage The average dynamic power = = We can conclude that power dissipation is proportional to clock frequency and proportional to square of supply voltage Trying to increase speed of a processor by simply increasing clock frequency at the same time as physical scaling of technology increases can only be done until the power wall is reached With current technology =100

Power in computational logic The delay time for a gate can be approximated to = is the cmos threshold voltage and, are technology dependent constants Delay will depend mostly on and for larger supply voltages Delay will increase dramatically when is decreased close to Dynamic voltage and frequency scaling means that both supply voltage and clock frequency is adjusted so that a processor can deliver just enough speed A reduction of both frequency and supply voltage will result in a dramatic reduction of dynamic power consumption

Using sleep mode to control energy consumption Reference Energy consumed during shutdown Energy consumed during wakeup = = Energy consumed when running processor for time = Energy consumed when going to sleep for time t = + + Energy can be saved when <

Example Battery powered oil detector for wastewater A smart sensor can detect petroleum contamination in wastewater Numerous sensors are installed at selected checkpoints which allow tracing of sources of contamination Task for sensor is to measure wastewater every 15 minutes and send alarm data over radio link whenever a contamination is detected. This task finishes in milliseconds while the rest of the 15 minutes cycle is spent on sleeping