Stream Processors. Many signal processing applications require. Programmability with Efficiency

Size: px
Start display at page:

Download "Stream Processors. Many signal processing applications require. Programmability with Efficiency"

Transcription

1 WILLIAM J. DALLY, UJVAL J. KAPASI, BRUCEK KHAILANY, JUNG HO AHN, AND ABHISHEK DAS, STANFORD UNIVERSITY Many signal processing applications require both efficiency and programmability. Baseband signal processing in 3G cellular base stations, for example, requires hundreds of GOPS (giga, or billions, of operations per second) with a power budget of a few watts, an efficiency of about 100 GOPS/W (GOPS per watt), or 10 pj/op (picojoules per operation). At the same time programmability is needed to follow evolving standards, to support multiple air interfaces, and to dynamically provision processing resources over different air interfaces. Digital television, surveillance video processing, automated optical inspection, and mobile cameras, camcorders, and 3G cellular handsets have similar needs. Conventional signal processing solutions can provide high efficiency or programmability, but are unable to provide both at the same time. In applications that demand efficiency, a hardwired application-specific processor ASIC (application-specific integrated circuit) or ASSP (application-specific standard part) has an efficiency of 50 to 500 GOPS/W, but offers little if any flexibility. At the other extreme, microprocessors and (digital 52 March 2004 QUEUE rants: feedback@acmqueue.com

2 Will this new kid on the block muscle out ASIC and DSP? more queue: QUEUE March

3 signal processors) are completely programmable but have efficiencies of less than 10 GOPS/W. DSP (digital signal processor) arrays and FPGAs (field-programmable gate arrays) offer higher performance than individual, but have roughly the same efficiency. Moreover, these solutions are difficult to program requiring parallelization, partitioning, and, for FPGAs, hardware design. Applications today must choose between efficiency and programmability. Where power budgets are tight, efficiency is the choice, and the signal processing is implemented with an ASIC or ASSP, giving up programmability. With wireless communications systems, for example, this means that only a single air interface can be supported or that a separate ASIC is needed for each air interface, with a static partitioning of resources (ASICs) between interfaces. processors are signal and image processors that offer both efficiency and programmability. processors have efficiency comparable to ASICs (200 GOPS/W), while being programmable in a high-level language. EXPOSING PARALLELISM AND LOCALITY A stream program (sometimes called a synchronous data-flow program) expresses a computation as a signal flow graph with streams of records (the edges) flowing between computation kernels (the nodes). Most signal-processing applications are naturally expressed in this style. For example, figure 1 shows a stream program that performs stereo depth extraction based on the algorithm of Kanade. 1 In this application, a stereo pair of images are first filtered and then compared image 0 image 1 Gaussian Gaussian FIG 1 with each other to extract the depth at each pixel of the image. Along the top path, a stream of pixels from the left image is filtered by a Gaussian kernel to reject high-frequency noise, generating a stream of smoothed pixels. This stream is filtered by a Laplacian kernel to highlight edges, generating a stream of edge-enhanced pixels. The right image follows a similar filtering path. The SAD (sum of absolute differences) kernel then compares a 7x7 sub-image about each pixel in the filtered left image with a row of 7x7 sub-images in the filtered right image to find the best match. The position of the best match gives the disparity between the two images at that pixel, from which we can derive the depth of the pixel. The stream program of figure 1 exposes both parallelism and locality. Each element of each input stream (all of the image pixels) can be processed simultaneously, exposing large amounts of data parallelism. This parallelism is particularly easy to identify in a stream program because the structure of the program makes the dependencies between kernels explicit. The complex disambiguation required when intermediate data is passed through memory arrays is not needed. Within each kernel, instruction-level parallelism is exposed since many independent operations can execute in parallel. Finally, the kernels can operate in parallel, operating on pixels or frames in a pipelined manner to expose thread-level parallelism. The stream program also exposes two types of locality: kernel and producer-consumer. During the execution of a kernel, all references are to variables local to the kernel except for values read from the input stream(s) and written to the output stream(s). This is kernel locality. Consider Laplacian Laplacian filtered image 0 filtered image 1 SAD depth map 54 March 2004 QUEUE rants: feedback@acmqueue.com

4 one implementation of the 7x7 convolution kernel. The inputs and outputs for the operations in the kernel require 176 references to values in local register files for every word accessed from the SRF (stream register file). Thus, because of kernel locality, we are able to ensure that one out of every 177 references is local to the kernel. Producer-consumer locality is exposed by the streams flowing between kernels. As one kernel produces stream elements, the next kernel consumes these elements in sequence. By appropriately sequencing kernels, the elements of an intermediate stream can be kept local values are consumed soon after they are produced. Each time the Gaussian kernel in figure 1 generates a block of the smoothed pixel stream, for example, this block can be consumed by the Laplacian kernel. Only a block of the intermediate stream exists at any in time, and this block can be kept in local storage. EXPLOITING PARALLELISM AND LOCALITY As shown in the block diagram of figure 2, a stream processor consists of a scalar processor, a stream memory system, an I/O system, and a stream execution unit, which consists of a microcontroller and an array of C arithmetic clusters. Each cluster contains a portion of the SRF, a collection of A arithmetic units, a set of local register files, and a local switch. A local register file is associated with each arithmetic unit. A global switch allows the clusters to exchange data. A stream processor executes an instruction set extended with kernel execution and stream load and store instructions. The scalar processor fetches all instructions. It executes scalar instructions itself, dispatches kernel execution instructions to the microcontroller and arithmetic clusters, and dispatches stream load and store instructions to the memory or I/O system. For each kernel execution instruction, the microcontroller starts execution of a for accessing local variables. Producer-consumer localscalar processor DRAM bank DRAM bank memory controller and I/O system microprogram broadcasting VLIW (very-long instruction word) instructions across the clusters until the kernel is completed for all records in the current block. A large number, C A, of arithmetic units in a stream processor exploit the parallelism of a stream program. A stream processor exploits data parallelism by operating on C stream elements in parallel, one on each cluster, under SIMD (single-instruction, multiple-data) control of the microcontroller. The instruction-level parallelism of a kernel is exploited by the multiple arithmetic units in each cluster that are controlled by the VLIW instructions issued by the microcontroller. If needed, thread-level parallelism can be exploited by operating multiple stream execution units in parallel. Research has shown that typical stream programs have sufficient data and instruction-level parallelism for media applications to keep more than 1,000 arithmetic units productively employed. 2 The exposed register hierarchy of the stream processor exploits the locality of a stream program. Kernel locality is exploited by keeping almost all kernel variables in local register files immediately adjacent to the arithmetic units in which they are to be used. These local register files provide very high bandwidth and very low power SRF lane SRF lane microcontroller CL SW CL SW LRF LRF LRF LRF FIG 2 more queue: QUEUE March

5 ity is exploited via the SRF. A producing kernel, such as the Gaussian kernel in figure 1, generates a block of an intermediate stream into the SRF, each cluster writing to its local portion of the SRF. A consuming kernel, such as the Laplacian kernel in figure 1, then consumes the block of stream elements directly from the SRF. To see how parallelism and locality are exploited in practice, figure 3 shows how the depth extraction program of figure 1 is mapped to the stream processor of figure 2. The input images are read from external memory or an I/O device into the SRF one block at a time, then the Gaussian kernel is run. Each cluster performs the kernel on a different pixel of the input, reading each pixel of the row of pixels previous partial sums new partial sums blurred row previous partial sums new partial sums sharpened row filtered row segment filtered row segment previous partial sums new partial sums depth map row segment block from the SRF and writing each pixel of the output block to the SRF. Most local variables for the kernels are kept in the local register files with some partial products cycled through the SRF. Overall, for each word accessed from memory or I/O, 23 words are referenced from the SRF, and 317 are referenced from local registers. This high fraction of references from local registers is not unique to the depth extractor. Figure 4 shows that kernel locality and producer-consumer locality exist in a broad range of applications and that a stream processor can successfully exploit this locality. The figure shows the bandwidth from main memory, the SRF, and local register files for six applications. The first column (depth) shows the bandwidth for the depth extractor of figure 1. MPEG is an MPEG2 encoder including motion estimation. QRD is a QR decomposition using the Householder method. STAP (space-time adaptive processing) is an adaptive beam-forming application. Render is an OpenGL 3D graphics-rendering program. RTSL (realtime shading language) is a renderer with a programmable shader. 3 For all of these programs, more than 95 percent of all references are from the local registers and less than 0.5 percent are convolution (Gaussian) convolution (Laplacian) SAD FIG 3 from external memory. While a conventional microprocessor or DSP can benefit from the locality and parallelism exposed by a stream program, it is unable to fully realize the parallelism and locality of streaming. A conventional processor has only a few (typically fewer than four, compared with hundreds for a stream processor) arithmetic units and thus is unable to exploit much of the parallelism exposed by a stream program. A conventional processor is unable to realize much kernel locality because it has too few processor registers (typically fewer than 32, compared with thousands for a stream processor) to capture the working set of a kernel. A processor s cache memory is unable to exploit much 56 March 2004 QUEUE rants: feedback@acmqueue.com

6 of the producer-consumer locality because there is little reuse of consumed data (the data is read once and discarded). Also, a cache is reactive, waiting for the data to be requested before fetching it. In contrast, data is proactively fetched into an SRF so it is ready when needed. Finally, a cache replaces data without regard to its liveness (using a least-recently used or random replacement strategy) and often discards data that is still needed. In contrast, an SRF is managed by a compiler in such a manner that only dead data (data that is no longer of interest) is replaced to make room for new data. EFFICIENCY Most of the energy consumed by a modern microprocessor or DSP is consumed by data and instruction movement, not by performing arithmetic. As illustrated in Energy Per Operation (0.13µm, 1.2V) TABLE 1 Operation Energy 32-bit arithmetic operation 32-bit register read 32-bit 8KB RAM read 32-bit traverse 10mm wire Execute instruction 5 pj 10 pj 50 pj 100 pj 500 pj Table 1, for a 0.13µm (micrometer) process operating from a 1.2V supply, a simple 32-bit RISC processor consumes 500 pj to perform an instruction, 4 whereas a single 32-bit arithmetic operation requires only 5 pj. Only 1 percent of the energy consumed by the instruction is used to perform arithmetic. The remaining 99 percent goes to overhead. This overhead is divided between instruction overhead (reading the instruction from a cache, updating the program counter, instruction decode, transmitting the instruction through a series of pipeline registers, etc.) and data overhead (reading data from a cache, reading and writing a multiport register file, transmitting operands and intermediate results through pipeline registers and bypass multiplexers, etc.). A stream processor exploits data and instruction locality to reduce this overhead so that approximately 30 percent of the energy is consumed by arithmetic operations. On the data side, the locality shown in figure 4 keeps most data movements over short wires, consuming little energy. The distributed register organization with a number of small local register files connected by a cluster switch is significantly more efficient than a single global register file. 5 Also, the SRF is accessed only once every 20 operations on average, compared with a data cache that is accessed once every three operations on average, greatly reducing memory access energy. On the instruction side, the energy required to read a microinstruction from the microcode memory is amortized across the data parallel clusters of a stream processor. Also, kernel microinstructions are simpler and hence have less control overhead than the RISC instructions executed by the scalar processor. FIG 4 TIME VERSUS SPACE MULTIPLEXING A stream processor timemultiplexes its hardware over the kernels of an application. All of the clusters work together on one kernel each operating on different data then they all proceed to the next kernel, and so on. This is shown on the left side of figure 5. In contrast, many tiled architectures (DSP more queue: QUEUE March

7 arrays) are space-multiplexed. Each kernel runs continuously on a different tile, processing the data stream in sequence, as shown on the right side of figure 5. The clusters of a stream processor exploit data parallelism, whereas the tiles of a DSP array exploit thread-level parallelism. Time multiplexing has two significant advantages over space multiplexing: load balance and instruction efficiency. As shown in figure 5, with time multiplexing the load is perfectly balanced across the clusters all of the clusters are busy all of the time. With space multiplexing, on the other hand, the tiles that perform shorter kernels are idle much of the time as they wait for the tile running the longest kernel to finish. The load is not balanced across the tiles: Tile 0 (the bottleneck tile) is busy all of the time, while the other tiles are idle much of the time. Particularly when kernel execution time is data dependent (as with many compression algorithms), load balancing a space-multiplexed architecture is impossible. A time-multiplexed architecture, on the other hand, is K1 K1 K1 K1 K2 K2 K2 K3 K3 K3 K3 K4 K4 K4 K4 K1 K1 K1 K1 K2 K2 K2 K2 FIG 5 K3 K3 K3 K3 K4 K4 K4 K4 always perfectly balanced. This often results in a 2x to 3x improvement in efficiency. Exploiting data parallelism rather than thread-level parallelism, a time-multiplexed architecture uses its instruction bandwidth more efficiently. Fetching an instruction is costly in terms of energy. The instruction er is incremented, an instruction cache is accessed, and the instruction must be decoded. The energy required to perform these operations often exceeds the energy performed by the arithmetic carried out by the instruction. On a space-multiplexed architecture, each instruction is used exactly once, and thus this instruction cost is added directly to the cost of each instruction. On a time-multiplexed architecture, however, the energy cost of an instruction is amortized across the parallel clusters that all execute the same instruction in parallel. This results in an additional 2x to 3x improvement in efficiency. STREAM PROGRAMMING TOOLS Mapping an application to a stream processor involves two steps: kernel scheduling, in which the operations of each kernel are scheduled on the arithmetic units of a cluster; and stream scheduling, in which kernel executions and data transfers are scheduled to use the SRF efficiently and to maximize data locality. We have developed a set of programming tools that automate both of these tasks so that a stream processor can be programmed entirely in C without sacrificing efficiency. Our kernel scheduler takes a kernel described in kernel C and compiles it to a VLIW microprogram. This compilation uses communication scheduling 6 to map each operation to a cycle number and arithmetic unit, and simultaneously schedule data movement necessary to provide operands. The compiler software pipelines inner loops, converting data parallelism to instruction-level parallelism where it is required to keep all operation units busy. To handle conditional (if-then-else) structures across the SIMD clusters, the compiler uses predication and conditional streams. 7 Figure 6 shows the schedule for the 7x7 convolution kernel from the depth extractor of figure 2 compiled to the Imagine stream processor (described later). Time is shown on the vertical axis and function units on the horizontal axis. The kernel scheduler is able to keep the multipliers (columns 4 and 5) busy nearly every cycle. The stream scheduler schedules not only the transfers of blocks of streams between memory, I/O devices, and the SRF, but also the execution of kernels. This task is comparable to scheduling DMA (direct memory access) transfers between off-chip memory and I/O that must be 58 March 2004 QUEUE rants: feedback@acmqueue.com

8 SHIFTA16 GEN_CISTATE COND_IN_D GEN_CCEND SPCREAD_WT SPCWRITE COMMUCDATA CHK_ANY LOOP SHIFTA16 GEN_CISTATE COND_IN_D GEN_CCEND SPCREAD_WT SPCWRITE COMMUCDATA S H I F T A 1 6 S E L E C T S E L E C T I M U L R N D 1 6 S E L E C T I M U L R N D 1 6 N S E L E C T I M U L R N D 1 6 I M U L R N D 1 6 C O N D _ I N _ D S P C R E A D _ W T S P C W R I T E C O M M U C D A T A C O M M U C P E R M C O M M U C P E R M C O M M U C P E R M S E L E C T C O M M U C P E R M S E L E C T N S E L E C T N S E L E C T D A T A _ I N D A T A _ I N D A T A _ I N S E L E C T D A T A I N N S E L E C T D A T A O U T S E L E C T D A T A _ I N D A T A _ O U T N S E L E C T D A T A _ I N D A T A _ O U T N S E L E C T D A T A _ O U T N S E L E C T D A T A _ O U T D A T A _ O U T D A T A _ O U T N S E L E C T S E L E C T C H K _ A N Y L O O P CHK_ANY LOOP G E N C I S T A T E G E N _ C C E N D performed manually for most conventional. The stream scheduler accepts a C++ program and outputs machine code for the scalar processor including stream load and store, I/O, and kernel execution instructions. The stream scheduler optimizes the block size so that the largest possible streams are transferred and operated on at a time, without overflowing the capacity of the SRF. This optimization is similar to the use of cache blocking on conventional processors and to stripmining of loops on vector processors. Figure 7 shows a stream schedule for an OpenGL polygon renderer. Time is shown vertically and space in the SRF horizontally. THE IMAGINE STREAM PROCESSOR Imagine, 8 shown in figure 8, is a prototype stream processor fabricated in a 0.18µm CMOS process. Imagine contains eight arithmetic clusters, each with six 32-bit floating- arithmetic units: three adders, two multipliers, and one divide-square root (DSQ) unit. With the exception of the DSQ unit, all units are fully pipelined and support 8-, 16-, and 32-bit integer operations, as well as 32-bit floating- operations. Each input of each arithmetic unit has a separate local register file of sixteen or thirty-two 32-bit words. The SRF has a capacity of 32KB 32-bit words (128KB) and can read 16 words per cycle (two words per cluster). The clusters are controlled by a 576-bit microinstruction. The microcontrol ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0 store holds 2K such instructions. The memory system interfaces to four 32-bit-wide SDRAM banks and reorders memory references to optimize bandwidth. Imagine also includes a network interface and router for connection to I/O devices and to combine multiple Imagines for larger signal-processing tasks. references and does little work with each result fetched, for example, would be limited by global bandwidth and not benefit from streaming. Happily, most signal processing applications have adequate data parallelism and locality. Even for those applications that do stream well, inertia represents a significant barrier to the adoption of stream processors. Though it is easy to program a stream proces- 7x7 Convolution Kernel From Depth Extraction Application Single Iteration Schedule ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0 Software Pipelining ADD0 ADD1 ADD2 MUL0 MUL1 DIV0 INP0 INP1 INP2 INP3 OUT0 OUT1 SP_0 SP_0 COM0 MC_0 JUK0 VAL0 CHALLENGES processors depend on parallelism and locality for their efficiency. For an application to stream well, there must be sufficient parallel work to keep all of the arithmetic units in all of the clusters busy. The parallelism need not be regular, and the work performed on each stream element need not be of the same type or even the same amount. If there is not enough work to go around, however, many of the stream processor s resources will (Above) Single iteration schedule idle and efficiency will suffer. For this reason, stream processors cannot efficiently handle some control-intensive (Right) Software pipelining shown applications that are dominated by a single sequential thread of control with little data parallelism. A streaming application must also have sufficient kernel and producerconsumer locality to keep global bandwidth from becoming a bottleneck. A program that makes random memory FIG 6 more queue: QUEUE March

9 sor in C, learning to use the stream programming tools and writing a complex streaming application still represents a significant effort. For evolutionary applications, it is often easier to reuse the existing code base for a conventional DSP, or the existing netlist for an ASIC rather than to develop new streaming code. An application must require both efficiency and flexibility to overcome this inertia. matrix viewport coordinate transform project lights shader rasterize FIG span prep span convert 7open GL graphics pipeline SRF allocation hash compact depth buffer sort merge Z compare image buffer THE FUTURE IS STREAMS Figure 9 shows a roadmap that illustrates how we expect stream processors to evolve with improving semiconductor process technology. The figure shows two lines of evolution. The top line represents floating- stream processors (that, like Imagine, support 32-bit floating- operations), and the bottom line represents fixed- processors that support just 8-, 16- and 32-bit integer operations. Integer operations are sufficient for most signal processing operations and, as the figure indicates, are significantly more efficient in terms of both area and power. Each in the roadmap represents a stream processor (integer or floating ) implemented in a particular technology with a nominal die size of 1 cm 2 (1.3 cm 2 for the floating processors). For each, the figure shows the performance and power at full voltage and the performance and power at reduced voltage (for more efficient operation). The performance, power, and area may be scaled up or down over a wide range by 60 March 2004 QUEUE rants: feedback@acmqueue.com

10 varying the number of clusters in the stream processor. 9 The main competition for stream processors are fixedfunction (ASIC or ASSP) processors. Though ASICs have efficiency as good as or better than stream processors, they are costly to design and lack flexibility. It takes about $15 million and 18 months to design a high-performance signal-processing ASIC for each application, and this cost is increasing as semiconductor technology advances. In contrast, a single stream processor can be reused across many applications with no incremental design cost, and software for a typical application can be developed in about six months for about $4 million. 10 In addition, this flexibility improves efficiency in applications where multiple modes must be supported. The same resources can be reused across the modes, rather than requiring dedicated resources for each mode that remain idle when the system is operating in a different mode. Also, flexibility permits new algorithms and functions to be easily implemented. Often the performance and efficiency advantage of a new algorithm greatly outweighs the small advantage of an ASIC over a stream processor. FPGAs are flexible, but lack efficiency and programmability. Because of the overhead of gate-level configurability, processors implemented with FPGAs have an efficiency of 2-10 MOPS per megawatt, comparable to that of conventional processors and. Newer FPGAs include large function blocks such as multipliers and microprocessors to partly address this efficiency issue. Also, though FPGAs are flexible, they are not programmable in a high-level language. Manual design to the register-transfer level is required for an FPGA, just as with an ASIC. Advanced compilers may someday ease the programming burden of FPGAs. With competitive energy efficiency, lower recurring costs, and the advantages of flexibility, we expect stream processors to replace ASICs in the most demanding of signalprocessing applications. Q REFERENCES 1. Kanade, T., Yoshida, A., Oda, K., Kano, H., and Tanaka, M. A stereo Imagine FIG 9 floating fixed machine for video-rate dense depth mapping and its new applications. Proceedings of the 15th Computer Vision and Pattern Recognition Conference (June 1996), Khailany B., Dally, W. J., Rixner, S., Kapasi, U. J., Owens, J. D., Towles, B. Exploring the VLSI scalability of stream processors. Proceedings of the Ninth International Symposium on High Performance Computer Architecture (Feb. 2003), Imagine: Prototype Processor HI Mbank 3 Mbank 2 Mbank 1 Mbank 0 SC SRF NI FIG 8 floating fixed microcontroller cluster 7 cluster 6 cluster 5 cluster 4 cluster 3 cluster 2 cluster 1 cluster 0 floating fixed more queue: QUEUE March

11 3. Owens, J. D., Khailany, B., Towles, B., and Dally, W. J. Comparing Reyes and OpenGL on a stream architecture. Siggraph/Eurographics Workshop on Graphics Hardware (Sept. 2002), A high-end superscalar processor may consume 10 nj or more per instruction because of much greater overheads required for acceleration techniques such as branch prediction, register renaming, and out-of-order execution. 5. Rixner, S., Dally, W. J., Khailany, B., Mattson, P., Kapasi, U. J., and Owens, J. D. Register organization for media processing. Proceedings of the Sixth International Symposium on High Performance Computer Architecture (Jan. 2000), Mattson, P., Dally, W. J., Rixner, S., Kapasi, U. J., and Owens, J. D. Communication scheduling. Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (Nov. 2000), Kapasi, U. J., Dally, W. J., Rixner, S., Mattson, P. R., Owens, J. D., and Khailany, B. Efficient conditional operations for data-parallel architectures. Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture (Dec. 2000), Khailany, B., Dally, W. J., Rixner, S., Kapasi, U. J., Mattson, P., Namkoong, J., Owens, J. D., Towles, B., and Chang, A. Imagine: Media processing with streams. IEEE Micro 21, 2 (Mar./Apr. 2001), See reference Robles, R. The cost/benefit ratio of ASICs in wireless baseband modems. Communications Design Conference (Oct. 2003); archive/papers/2003/p212.htm. LOVE IT, HATE IT? LET US KNOW feedback@acmqueue.com or WILLIAM J. DALLY is the Willard R. and Inez Kerr Bell Professor of Engineering at Stanford University. Dally has done pioneering development work at Bell Telephone Laboratories, Caltech, and the Massachusetts Institute of Technology, where he was a professor of electrical engineering and computer science. At Stanford University, his group has developed the Imagine processor, which introduced the concepts of stream processing and partitioned register organizations. Dally has worked with Cray Research and Intel to incorporate many of these innovations in commercial parallel computers, with Avici Systems to incorporate this technology into Internet routers, and he cofounded Velio Communications to commercialize high-speed signaling technology, and to commercialize stream processor technology. He is a fellow of IEEE and ACM, and has received numerous honors including the ACM Maurice Wilkes award. He has published more than 150 papers in these areas and is an author of the textbooks Digital Systems Engineering (Cambridge University Press, 1998) and Principles and Practices of Interconnection Networks (Morgan Kaufmann, 2003). UJVAL J. KAPASI was expected to receive his doctorate from Stanford University in February His research interests include computer architecture, scientific computing, and language and compiler design. While at Stanford, he was an architect of the Imagine stream processor and contributed to the VLSI implementation of an Imagine prototype. Recently, he cofounded, which is commercializing the stream-processing technology developed at Stanford. BRUCEK KHAILANY received his Ph.D. from Stanford University in 2002, where he was the principal VLSI designer of the Imagine stream processor. Recently, as a cofounder of, he has been working on the commercialization of stream processors in a variety of application areas. He is a member of IEEE and ACM, was an Intel Foundation fellowship recipient at Stanford, and received a BSEE from the University of Michigan in JUNG HO AHN is a Ph.D. candidate in electrical engineering at Stanford University. His research interests include computer architecture, stream architecture, advanced memory design, and compiler design. Ahn received an MS in electrical engineering from Stanford. He is a student member of ACM and IEEE. ABHISHEK DAS is a Ph.D. candidate in electrical engineering at Stanford University. He received his B.Tech in computer science and engineering from IIT Kharagpur, India. His research interests include compiler and language design, computer systems software, and computer architecture. He has worked on making TCP/IP feasible for the Bluetooth technology at the IBM Research Center, India. He is currently involved in the development and enhancement of the compiler and architecture for the Merrimac and Imagine architectures ACM /04/0300 $ March 2004 QUEUE rants: feedback@acmqueue.com

Stream Processing for High-Performance Embedded Systems

Stream Processing for High-Performance Embedded Systems Stream Processing for High-Performance Embedded Systems William J. Dally Computer Systems Laboratory Stanford University HPEC September 25, 2002 Stream Proc: 1 Sept 25, 2002 Report Documentation Page Form

More information

IMAGINE: Signal and Image Processing Using Streams

IMAGINE: Signal and Image Processing Using Streams IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture

More information

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop

Stream Processor Architecture. William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Processor Architecture William J. Dally Stanford University August 22, 2003 Streaming Workshop Stream Arch: 1 August 22, 2003 Some Definitions A Stream Program expresses a computation as streams

More information

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles Stanford University Concurrent VLSI Architecture Memo 122 Stanford University Computer Systems Laboratory Stream Scheduling Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor

The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore The Implementation and Analysis of Important Symmetric Ciphers on Stream Processor

More information

EE482S Lecture 1 Stream Processor Architecture

EE482S Lecture 1 Stream Processor Architecture EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered

More information

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture 1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered

More information

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Hamed Fatemi 1,2, Henk Corporaal 2, Twan Basten 2, Richard Kleihorst 3,and Pieter Jonker 4 1 h.fatemi@tue.nl 2 Eindhoven

More information

The Imagine Stream Processor

The Imagine Stream Processor The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Computer Systems Laboratory Computer Systems Laboratory Stanford University, Stanford, CA

More information

IMAGINE: MEDIA PROCESSING

IMAGINE: MEDIA PROCESSING IMAGINE: MEDIA PROCESSING WITH STREAMS THE POWER-EFFICIENT IMAGINE STREAM PROCESSOR ACHIEVES PERFORMANCE DENSITIES COMPARABLE TO THOSE OF SPECIAL-PURPOSE EMBEDDED PROCESSORS. EXECUTING PROGRAMS MAPPED

More information

Evaluating the Imagine Stream Architecture

Evaluating the Imagine Stream Architecture Evaluating the Imagine Stream Architecture Jung Ho Ahn, William J. Dally, Brucek Khailany, Ujval J. Kapasi, and Abhishek Das Computer Systems Laboratory Stanford University, Stanford, CA 94305, USA {gajh,billd,khailany,ujk,abhishek}@cva.stanford.edu

More information

Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis

Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis Lecture 16 Data Level Parallelism (3) EEC 171 Parallel Architectures John Owens UC Davis Credits John Owens / UC Davis 2007 9. Thanks to many sources for slide material: Computer Organization and Design

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Accelerated Motion Estimation of H.264 on Imagine Stream Processor

Accelerated Motion Estimation of H.264 on Imagine Stream Processor Accelerated Motion Estimation of H.264 on Imagine Stream Processor Haiyan Li, Mei Wen, Chunyuan Zhang, Nan Wu, Li Li, Changqing Xun School of Computer Science, National University of Defense Technology

More information

THE VLSI IMPLEMENTATION AND EVALUATION OF AREA- AND ENERGY-EFFICIENT STREAMING MEDIA PROCESSORS

THE VLSI IMPLEMENTATION AND EVALUATION OF AREA- AND ENERGY-EFFICIENT STREAMING MEDIA PROCESSORS THE VLSI IMPLEMENTATION AND EVALUATION OF AREA- AND ENERGY-EFFICIENT STREAMING MEDIA PROCESSORS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES

More information

Understanding Sources of Inefficiency in General-Purpose Chips

Understanding Sources of Inefficiency in General-Purpose Chips Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex Solomatnikov Benjamin Lee Stephen Richardson Christos Kozyrakis Mark Horowitz GP Processors

More information

Register Organization and Raw Hardware. 1 Register Organization for Media Processing

Register Organization and Raw Hardware. 1 Register Organization for Media Processing EE482C: Advanced Computer Organization Lecture #7 Stream Processor Architecture Stanford University Thursday, 25 April 2002 Register Organization and Raw Hardware Lecture #7: Thursday, 25 April 2002 Lecturer:

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Flexible wireless communication architectures

Flexible wireless communication architectures Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April

More information

CONDITIONAL TECHNIQUES FOR STREAM PROCESSING KERNELS

CONDITIONAL TECHNIQUES FOR STREAM PROCESSING KERNELS CONDITIONAL TECHNIQUES FOR STREAM PROCESSING KERNELS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

Comparing Reyes and OpenGL on a Stream Architecture

Comparing Reyes and OpenGL on a Stream Architecture Comparing Reyes and OpenGL on a Stream Architecture John D. Owens Brucek Khailany Brian Towles William J. Dally Computer Systems Laboratory Stanford University Motivation Frame from Quake III Arena id

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

A Bandwidth-efficient Architecture for a Streaming Media Processor

A Bandwidth-efficient Architecture for a Streaming Media Processor A Bandwidth-efficient Architecture for a Streaming Media Processor by Scott Rixner B.S. Computer Science Massachusetts Institute of Technology, 1995 M.Eng. Electrical Engineering and Computer Science Massachusetts

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

Embedded computing applications demand both

Embedded computing applications demand both C o v e r f e a t u r e Efficient Embedded Computing William J. Dally, James Balfour, David Black-Shaffer, James Chen, R. Curtis Harting, Vishal Parikh, Jongsoo Park, and David Sheffield, Stanford University

More information

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory,

More information

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF

More information

Design guidelines for embedded real time face detection application

Design guidelines for embedded real time face detection application Design guidelines for embedded real time face detection application White paper for Embedded Vision Alliance By Eldad Melamed Much like the human visual system, embedded computer vision systems perform

More information

Major Advances (continued)

Major Advances (continued) CSCI 4717/5717 Computer Architecture Topic: RISC Processors Reading: Stallings, Chapter 13 Major Advances A number of advances have occurred since the von Neumann architecture was proposed: Family concept

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras CAD for VLSI Debdeep Mukhopadhyay IIT Madras Tentative Syllabus Overall perspective of VLSI Design MOS switch and CMOS, MOS based logic design, the CMOS logic styles, Pass Transistors Introduction to Verilog

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers.

PART A (22 Marks) 2. a) Briefly write about r's complement and (r-1)'s complement. [8] b) Explain any two ways of adding decimal numbers. Set No. 1 IV B.Tech I Semester Supplementary Examinations, March - 2017 COMPUTER ARCHITECTURE & ORGANIZATION (Common to Electronics & Communication Engineering and Electronics & Time: 3 hours Max. Marks:

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS

ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS ADVANCED COMPUTER ARCHITECTURE TWO MARKS WITH ANSWERS 1.Define Computer Architecture Computer Architecture Is Defined As The Functional Operation Of The Individual H/W Unit In A Computer System And The

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI Yi Ge Mitsuru Tomono Makiko Ito Yoshio Hirose Recently, the transmission rate for handheld devices has been increasing by

More information

Memory Access Scheduling

Memory Access Scheduling Memory Access Scheduling ECE 5900 Computer Engineering Seminar Ying Xu Mar 4, 2005 Instructor: Dr. Chigan 1 ECE 5900 spring 05 1 Outline Introduction Modern DRAM architecture Memory access scheduling Structure

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009 Digital Signal Processing 8 December 24, 2009 VIII. DSP Processors 2007 Syllabus: Introduction to programmable DSPs: Multiplier and Multiplier-Accumulator (MAC), Modified bus structures and memory access

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers

William Stallings Computer Organization and Architecture. Chapter 12 Reduced Instruction Set Computers William Stallings Computer Organization and Architecture Chapter 12 Reduced Instruction Set Computers Major Advances in Computers(1) The family concept IBM System/360 1964 DEC PDP-8 Separates architecture

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Multimedia in Mobile Phones. Architectures and Trends Lund

Multimedia in Mobile Phones. Architectures and Trends Lund Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Energy efficiency vs. programmability trade-off: architectures and design principles

Energy efficiency vs. programmability trade-off: architectures and design principles Energy efficiency vs. programmability trade-off: architectures and design principles J.P Robelly, H. Seidel, K.C Chen, G. Fettweis Dresden Silicon GmbH. Helmholtzstrasse 18, 01069 Dresden, Germany robelly@dresdensilicon.com

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING

MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING MEMORY HIERARCHY DESIGN FOR STREAM COMPUTING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital

More information

Improving Power Efficiency in Stream Processors Through Dynamic Cluster Reconfiguration

Improving Power Efficiency in Stream Processors Through Dynamic Cluster Reconfiguration Improving Power Efficiency in Stream Processors Through Dynamic luster Reconfiguration Sridhar Rajagopal WiQuest ommunications Allen, T 75 sridhar.rajagopal@wiquest.com Scott Rixner and Joseph R. avallaro

More information

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation

Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Stream Processing: a New HW/SW Contract for High-Performance Efficient Computation Mattan Erez The University of Texas at Austin CScADS Autotuning Workshop July 11, 2007 Snowbird, Utah Stream Processors

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Advanced Topics in Computer Architecture

Advanced Topics in Computer Architecture Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis A Data-Parallel Genealogy: The GPU Family Tree John Owens University of California, Davis Outline Moore s Law brings opportunity Gains in performance and capabilities. What has 20+ years of development

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors

Portland State University ECE 587/687. The Microarchitecture of Superscalar Processors Portland State University ECE 587/687 The Microarchitecture of Superscalar Processors Copyright by Alaa Alameldeen and Haitham Akkary 2011 Program Representation An application is written as a program,

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Bifrost - The GPU architecture for next five billion

Bifrost - The GPU architecture for next five billion Bifrost - The GPU architecture for next five billion Hessed Choi Senior FAE / ARM ARM Tech Forum June 28 th, 2016 Vulkan 2 ARM 2016 What is Vulkan? A 3D graphics API for the next twenty years Logical successor

More information

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing Linköping University Post Print epuma: a novel embedded parallel DSP platform for predictable computing Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu N.B.: When citing this work, cite the original article.

More information

Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration

Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration Improving Power Efficiency in Stream Processors Through Dynamic Reconfiguration June 5, 24 Abstract Stream processors support hundreds of functional units in a programmable architecture by clustering those

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator). Microprocessors Von Neumann architecture The first computers used a single fixed program (like a numeric calculator). To change the program, one has to re-wire, re-structure, or re-design the computer.

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

Exploring the VLSI Scalability of Stream Processors

Exploring the VLSI Scalability of Stream Processors Exploring the VLSI Scalability of Stream Processors Brucek Khailany, William J. Dally, Scott Rixner, Ujval J. Kapasi, John D. Owens, and Brian Towles Computer Systems Laboratory Computer Systems Laboratory

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC

More information

TRIPS: Extending the Range of Programmable Processors

TRIPS: Extending the Range of Programmable Processors TRIPS: Extending the Range of Programmable Processors Stephen W. Keckler Doug Burger and Chuck oore Computer Architecture and Technology Laboratory Department of Computer Sciences www.cs.utexas.edu/users/cart

More information

A Data-Parallel Genealogy: The GPU Family Tree

A Data-Parallel Genealogy: The GPU Family Tree A Data-Parallel Genealogy: The GPU Family Tree Department of Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis Outline Moore s Law brings

More information

RISC Processors and Parallel Processing. Section and 3.3.6

RISC Processors and Parallel Processing. Section and 3.3.6 RISC Processors and Parallel Processing Section 3.3.5 and 3.3.6 The Control Unit When a program is being executed it is actually the CPU receiving and executing a sequence of machine code instructions.

More information

Lecture 4: RISC Computers

Lecture 4: RISC Computers Lecture 4: RISC Computers Introduction Program execution features RISC characteristics RISC vs. CICS Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation

More information

The Bifrost GPU architecture and the ARM Mali-G71 GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016 Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our

More information

3.1 Description of Microprocessor. 3.2 History of Microprocessor

3.1 Description of Microprocessor. 3.2 History of Microprocessor 3.0 MAIN CONTENT 3.1 Description of Microprocessor The brain or engine of the PC is the processor (sometimes called microprocessor), or central processing unit (CPU). The CPU performs the system s calculating

More information

Architectures of Flynn s taxonomy -- A Comparison of Methods

Architectures of Flynn s taxonomy -- A Comparison of Methods Architectures of Flynn s taxonomy -- A Comparison of Methods Neha K. Shinde Student, Department of Electronic Engineering, J D College of Engineering and Management, RTM Nagpur University, Maharashtra,

More information

Design space exploration for real-time embedded stream processors

Design space exploration for real-time embedded stream processors Design space exploration for real-time embedded stream processors Sridhar Rajagopal, Joseph R. Cavallaro, and Scott Rixner Department of Electrical and Computer Engineering Rice University sridhar, cavallar,

More information

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

COMPUTER ORGANISATION CHAPTER 1 BASIC STRUCTURE OF COMPUTERS

COMPUTER ORGANISATION CHAPTER 1 BASIC STRUCTURE OF COMPUTERS Computer types: - COMPUTER ORGANISATION CHAPTER 1 BASIC STRUCTURE OF COMPUTERS A computer can be defined as a fast electronic calculating machine that accepts the (data) digitized input information process

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information