Impact of Power Density Limitation in Gigascale Integration for the SIMD Pixel Processor

Size: px
Start display at page:

Download "Impact of Power Density Limitation in Gigascale Integration for the SIMD Pixel Processor"

Transcription

1 Impact of Power Density Limitation in Gigascale Integration for the SIMD Pixel Processor Sek M. Chai, Antonio Gentile, D. Scott Wills School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia {sek, gentile, Abstract Gigascale Integration (GSI) enables a new generation of monolithic focal plane processing systems built with billion-transistor chips. As this technology matures, fundamental technology limitations on wire interconnects and power dissipation will become the performance bottleneck. This paper presents system performance projections for GSI technologies under these constraints. Architectural models and workload characterization are integrated to identify viable future system implementations. The SIMD Pixel processor (SIMPil) is selected as the architecture for evaluation, and an image processing application suite is programmed to characterize the workload. Projections for SIMPil systems show that over three orders of magnitude improvement is achievable by 2012 in both system throughput and image resolution. System power consumption is contained below 50 Watts for a 52,900 processor system in 50 nm technology. The SIMPil architecture design space is explored, and opportunities for more aggressive designs within power density limits are examined. 1.0 Introduction The National Technology Roadmap for Semiconductors (NTRS) projects a two-billiontransistor monolithic chip by 2012 [11]. At Gigascale transistor-density levels, the power consumption of a chip can easily extend to a level beyond its heat extraction or battery supply capabilities. If performance were expected to double every year, power dissipation, which is growing at approximately 10 Watts per year for general-purpose microprocessors, can increase beyond power savings gained by technology [9][12]. To improve the balance between power consumption and performance, architecture and application must be studied together to find the full system impact on targeted domains. This paper evaluates the SIMD Pixel processor (SIMPil), a fine-grain architecture for focal plane image processing. An application suite covering different aspects of image processing (image compression, filtering, and analysis) is programmed for SIMPil [7]. Applications are simulated to extract average workload characteristics, instruction histograms, and system concurrency to determine functional unit utilization under realistic operating conditions. This approach provides an accurate estimation of power dissipation and efficiency of each functional unit. 1

2 VLSI layout information are extracted from the current design and expressed in terms of silicon area, transistor count, and total capacitance for each functional unit. This information forms the system implementation description for analysis. Technology parameters are extracted from [11] and used to define the technology scenarios. Combining system and technology descriptions, system performance parameters such as instruction throughput, image resolution, area consumption, and power dissipation are projected. Clock frequency in a technology scenario is chosen in agreement to the limit imposed by power density, instead of using clock frequency values as projected in the roadmap. SIMPil is an embedded architecture that is a good candidate for focal plane image processing. Unlike current general-purpose microprocessors, the architecture reduces datapath complexity by specializing for image processing domains. Interprocessor communication paths are near-neighbor to maintain short wire lengths. In comparison, new architectural features in existing general-purpose microprocessors are offering diminishing returns as complexity and long broadcast wires become design bottlenecks [13]. Focal plane image processing applications are stream-oriented, and large caches in general-purpose microprocessors are not efficiently used because every stream element is read exactly once [5]. Many different SIMD systems have been proposed [2][8][10], which offer the required I/O and computational throughput to handle image processing applications. However, their performance and generality come at the expense of I/O coupling, power consumption, and portability. As the goal of this paper is to determine performance parameters of SIMPil in future technology, only a single system is evaluated. This paper will show that SIMPil can maintain the balance in performance within the power density limits imposed by technology scenarios projected by [11]. The rest of the paper is organized as follows. Section 2 describes the architecture of the SIMPil system being developed at Georgia Tech. Section 3 presents a table of symbols and definitions. Section 4 resents a profile of the image processing applications implemented on SIMPil and the workload characteristics. Section 5 introduces the modeling effort incorporated in a Technology Scenario Analyzer (TeSA) tool to project system parameters for different technologies using semiconductor roadmap projections. Section 6 presents results and evaluation. Conclusions are offered in Section SIMPil System Architecture The SIMD Pixel Processor (SIMPil) is a focal plane image processing system which employs area-array I/O to access directly to the processors. The SIMPil design explores the benefits of integrating an image sensor array with a high-performance multiprocessorcomputing plane. This monolithic integration of image sensors and digital processing elements is the key-feature of the SIMPil system. In SIMPil, the image stream flows directly from the focal plane into the processing plane, retaining its spatial correlation, as depicted in Figure 1. 2

3 ACU Figure 1: The SIMPil system. Image streams are optically focussed into the sensor array, and hence mapped onto the processing engine in a single operation. The SIMPil architecture consists of a mesh of SIMD processors. A block diagram for a 16- bit implementation is illustrated in Figure 2. The instruction set architecture allows a single processing element (PE) to address a 4 4 array of image sensors. Each processor incorporates an analog to digital converter to convert light intensities, incident on the sensors, into digital values. The SAMPLE instruction simultaneously collects all sensor values and makes them available for further processing. Each processing element is a simplified RISC processor that contains the following functional units (FU): 16 bit ALU with adder/subtractor and barrel shifter; Multiply-accumulator unit with a 32 bit accumulator register; 16 three-ported general purpose and special registers; 64 words of local memory (256 maximum words); Communication and serial I/O units; Masking unit to control PE activity. Neighboring PEs PE Communication Unit Arithmetic, Logical, and Shift Unit Register File 16 by 16 bit 2 read, 1 write Multiply Accumulator Image Sensor Subarray ADC Local Memory (64 words) Processor Array Special Registers & I/O Decoder Figure 2: Block diagram of a 16-bit implementation of a SIMPil PE. Each PE is directly interfaced to a small array of image sensors. PE's are connected together via a NEWS mesh. 3

4 SIMPil PEs are connected through a NEWS network. Any entry in the register file can be used as source or destination in a communication instruction. In addition, constant data can also be received (or transmitted) serially through a specialized serial I/O unit. Data reception or transmission occurs without interrupting the normal PE operation. All instructions execute in a single cycle. Figure 3: Symbolic layout of a SIMPil16 prototype. The chip measures mm 2, and it packs about 38,590 transistors. It is fabricated in HP 0.8 µm CMOS process and housed in a 132-pin PGA. Early prototyping efforts have proved the feasibility of direct coupling of a simple processing core with a sensor device [3]. A 16 bit prototype of a SIMPil PE was designed in 0.8 µm CMOS process and fabricated through MOSIS. The prototypes were successfully tested and run at 25 MHz. The symbolic layout of the prototype PE is shown in Figure 3. The prototype PE measures mm 2, and contains a total of 38,590 transistors. SIMPil functional units are specified in Table 1, in terms of silicon area and transistor number. A single PE is estimated to consume about 44.1 mw at 5 V, running at 25 MHz, over the entire application workload. 4

5 Table 1: SIMPil FUs specifications for the 16-bit implementation. Functional Units Area (mm 2 ) Number of Transistors MACC ,844 MEMORY ,098 REGFILE ,974 COMM UNIT SERIAL I/O ,006 ALU ,620 BARREL SHIFTER ,118 SLEEP UNIT DECODER BUS DRIVER Large arrays of SIMPil PEs can be simulated using the SIMPil Simulator [14]. This software tool is an instruction level simulator, running under Windows95. Applications for the SIMPil system can be edited, assembled, executed, and debugged within this single integrated workbench. Metering facilities are also built in the simulator to determine the concurrency level, memory usage, and instruction histograms during execution. 3.0 Glossary Table 2. List of symbols and their definitions A eff Effective die area n gate Logic gates in critical path A max Maximum die area * N tranpe Number of transistors per PE A pad Total pad area N pe Total number of PEs A wire Total wiring area η power Power efficiency metric α Transistor activity factor η area Area efficiency metric C o Output load capacitance P clk Power from clock distribution C Htree Total capacitance in H-Tree P eff Effective total power C w Wiring capacitance P max Maximum power dissipation* design Design factor P pad Power dissipated in pads ε o Permittivity in vacuum PPE Pixels to Processor ratio ε r Dielectric permittivity* Res System image resolution in pixels E i Effective energy consumption ρ tran Maximum transistor density* f c Operating clock frequency S Scaling factor f c,power System clock frequency τ gate Single gate delay* f c,power Power-limited clock frequency V Minimum logic Vdd* IPC Instruction per cycle W i Workload factor I T Avg system instruction throughput W clk Clock wire width L pe Dimension of PE U System Utilization N FU Number of functional units *Indicated technology values obtained from [11]. Other values are derived, modeled, or calculated. 5

6 4.0 Workload Characterization The SIMPil architecture is designed for image and video processing applications. In general, this class of applications is computationally intensive and requires high throughput to handle the massive data flow in real-time. However, these applications offer a large degree of data parallelism, which is not usually exploited by sequential image processing systems. SIMPil combines focal plane image acquisition with a SIMD execution model to exploit available data parallelism and remove I/O bottleneck. Image frames are available simultaneously at each PE in the system, and their spatial correlation is retained. To evaluate the set of architectural design choices implemented in the SIMPil system, the following image-processing applications have been implemented and simulated using the SIMPil16 Simulator. Details on the implementations are offered elsewhere [3][7]. Spatial filtering. The implementation performs 2D convolution-based filtering. Operations such as shadowing, edge detection, and smoothing are executed using appropriate 3 3-filter masks. Discrete Fourier transform. 2D Discrete Fourier Transform has been implemented using a matrix multiplication algorithm. The original image is transformed row first then columns. The weight matrices are preloaded into the system, and they are rearranged to support the nearestneighbor communication scheme available on SIMPil. Fixed-point arithmetic is used to implement the algorithm. Morphological filtering. Basic morphological operations (erosion, dilation) have been implemented using a 3 3 structuring element. These operations are implemented as intersection and union of shifted versions of the original image. More complex operations, such as opening, closing, inside edge detection, and skeletonization are then implemented by combining the two basic operations. Wavelet decomposition. Discrete wavelet decomposition has been implemented for fingerprint compression and archival. Standard Daubechie's filters have been used to implement the low/high pass filters. A row-column scheme decomposes a gray-level image into 61 frequency bands. Image rotation. A parallel rotation algorithm has been implemented to perform fast rotations of binary images. The rotation angle γ is first expressed as π π γ = α + n, α 0,, and = n. 2 Rotations are then are executed in two stages: a skew-based rotation of the angle α, and then a set of n fast ninety-degree rotations. This scheme is well suited for a SIMD implementation with regular communication patterns. Image labeling. This implementation is based on a cluster analysis algorithm. It is used to classify objects in a binary image on the basis of object diameter. The objects are then labeled accordingly. Quadtree region representation. This implementation operates on binary images to generate a quadtree representation. Quadtrees are based on the principle of recursive decomposition of space. The image is first decomposed in four equal-sized quadrants. If a quadrant is not uniform (entirely filled/empty), it is further decomposed in four more subquadrants. The 6

7 decomposition stops when uniform quadrants are encountered, or the quadrant contains a single pixel. Region identification. In this implementation, a small region of interest is identified using chromatic information. Several stages are executed to complete the task, including binarization, quadtree generation, region isolation, and region zooming. Larger applications, such as JPEG image encoding, and region clustering are currently being implemented, integrating various components into larger applications. The above applications were simulated and the instruction histograms were generated. As this paper focuses on the design of SIMPil PEs, scalar instructions have been excluded from the analysis. The instructions executed in each PE have been divided among the different functional units, and the results are listed in Table 3, along with the average system utilization. Table 3: Workload characterization. Average system and functional unit utilizations are given for each application. Only instructions executed in the PE are considered to compute the utilization of each FU. Applications System Functional Units Utilization (%) Utilization (%) ALU MACC SHIFT MEM COMM MASK PIXEL IED SKL LBL WLT QTREE SKEW RING SF DFT REGION IED: Inside Edge Detection SKL: Skeletonization LBL: Image Labeling WLT: Wavelet Decomposition QTREE: Quad Tree Decomposition ROT Skew: Skew-based Rotation ROT Ring: 90 Ring Rotation SF: Spatial Filtering DFT: Discrete Fourier Transform REGION: Region Identification This application set characterizes a typical workload for the SIMPil architecture. Two elements in particular will be considered in the architecture models discussed in the next section: the system utilization (U), and the workload factor (W i ) for a SIMPil PE. These values are averaged over the entire set of applications and are listed in Table 4. This characterization is done on a per cycle basis because the power analysis is a rate measurement of energy consumption. SIMPil performances over the workload are detailed elsewhere [7]. Table 4: Average system utilization and workload factors for a SIMPil PE. System Workload Factors (W i ) Utilization (U) ALU MACC SHIFT MEM COMM MASK PIXEL 71.61% 33.60% 3.43% 5.04% 28.46% 14.18% 14.85% 0.44% 7

8 5.0 Architecture Modeling A TEchnology Scenario Analyzer (TeSA) tool has been built to project future system performance. TeSA incorporates application characteristics, such as system utilization (U) and workload factor (W i ), with architectural and technology models. Architectural models are defined by VLSI layout information and expressed in terms of silicon area, transistor count, and total capacitance for each functional unit. Technology parameters are extracted from semiconductor roadmap and used to define the technology scenarios. This section presents salient features of TeSA. Power and area reduction factors are described along with capacitance calculation and technology scaling. System sizes are calculated from transistor densities. Selected performance parameters such as clock frequency, power dissipation, system pixel resolution, and sustained throughput are determined. Clock Frequency Model SIMPil is evaluated in terms of power efficiency (η power ) and area efficiency (η area ) metrics by considering throughput per power consumed (Mops/Joule) and throughput per silicon area consumed (Mops/s mm 2 ). The following equations illustrate these metrics: η power I T f = P eff c η area IT f = A The efficiency metrics are functions of instruction throughput, clock frequency, and resource cost such as power and area. The system instruction throughput (I T ) is calculated from the average concurrency of the system (U), the single PE instruction throughput (IPC), and the total number of processing elements (N pe ). P eff is effective power calculated from maximum system power (P max ) reduced by power consumed from pad and clock distribution. A eff is effective silicon area consumed, and it is calculated from maximum die size (A max ) reduced by area consumed for pads, bus wiring, and inter-node routing. The system clock frequency (f c,sys ) is determined from the critical path gate depth (n gate ), and a single gate delay (τ gate ). This value does not account for the limit posed by the maximum power dissipated from a chip by a heat sink. A maximum clock frequency (f c,power ) can be calculated from the maximum power density for SIMPil. f c,power is a function of the application workload factor (W i ) and the effective energy consumption (E i ). f c,sys and f c,power are described by the following equations: f c, sys = n gate 1 τ gate In TeSA, the operating clock frequency (f c ) is set as: ( f f ) f c = min c, sys, c, power eff P fc power = max, N PE c N FU i 1 EiWi This approach ensures that the operating clock frequency is below the upper bound set by power density limits. As a design choice for SIMPil, the clock frequency is not set as the maximum frequency possible in a given technology, but below a value set by power density. 8

9 Effective Area and Power Models TeSA includes the effects of area and power consumptions due to I/O pads and wiring interconnects. The effective area and power available for the system are calculated with the following equations. A eff = A max Apad Awire P eff = P max P pad P clk Area consumed by I/O pads (A pad ) is determined as a percentage of total available area (A max ). A 0.8 µm output pad area is used as a baseline, and the appropriate percentage reduction for future technology is applied. Area consumed from internal wiring (A wire ) is also calculated as percentage reductions, with the 0.8 µm implementation as a baseline. Power dissipated in I/O pads (P pad ) is determined as a percentage of total available power (P max ). H-SPICE simulations are used to determine power for a 0.8 µm output pad through a 64-pin PGA. This value is used as a baseline to calculate P pad for future technologies. Power dissipated in distributing the clock (P clk ) can be a large portion of the power budget. For the SIMPil system, a H-Tree clock distribution scheme [15] is used, as illustrated in Figure 4. The H-Tree provides a well-balanced signal propagation scheme for clock distribution. The signal paths to the next H-Tree level are equal in length. Line drivers are scaled for each level of the H-Tree proportionally to the signal path length. Total output capacitance is calculated including wire and output loads. Line Driver PE dimension 2nd level Htree 1st level Htree Figure 4. H-Tree clock distribution scheme for multi-node SIMPil System. Scaled line drivers and wire lengths are calculated in terms of PE dimensions and system size. The following equations illustrate the calculation of total capacitance and power dissipation for the entire H-Tree. The total capacitance, C Htree, is the aggregate capacitance for each H-Tree level. log CHtree = 4 N PE 1 ε ε i C 2 + on PE r owclklpe i= 0 i 2 The number of H-Tree levels is given by log 4 N PE 1. For each H-Tree level, a capacitance is calculated as the sum of two terms. The first term is dependent on signal wire length. The second term is total output capacitance from the line drivers. Power for clock distribution will subsequently increase substantially with advancing technology as the number of processing elements (N PE ) increases. P clk is given by. P clk = C Htree V 2 f c 9

10 Capacitance Scaling and Energy Consumption Models A SIMPil processing element is divided into the following functional units: ALU, multiply accumulate unit (MACC), barrel shifter, register file, on-chip memory, communication unit, instruction decoder, sleep unit, and bus drivers. For each unit, the load capacitance (C o ) and wire capacitance (C w ) are extracted from the implemented 0.8 µm design using the MAGIC VLSI layout tool kit and H-SPICE. TeSA adopts two different scaling methodologies for transistor load capacitance and wires to account for different scaling properties of wire interconnect and transistor drain/gate capacitances. The following equations describe the load and wire capacitance scaling. C ε 1 ' ' r ' w = C C w o = Co ε r s s In the above equation, the tick marks indicate values in future technology. The wire capacitance scales with the improvements in permittivity as well as reduced wire length with smaller feature sizes. Because a SIMPil processing element communicates only with its neighbors through near-neighbor interconnection network, global communication wires are ignored. Output load capacitance scales with the feature-size scaling factor (S) [1]. Effective energy consumption during transistor switching (E i ) is calculated with the following equation. E ( C C ) V 2 i = α w + A transistor activity (α) is assumed for every functional unit. The application workload utilization (W i ) is used to determine the activity workload of each functional unit. Groups of functional units that are active during different instruction types are formed. Active functional units contribute to energy consumption during the operating cycle. For example, an ALU operation requires the ALU, register file, and bus drivers to be active. In comparison, a LOAD operation requires the memory, register file, and bus drivers to be active. In each instruction group, the energy terms E i of each functional unit are summed, each in proportion to the activity of that unit. These sums are used to determine the operating clock frequency (f c ) described earlier. Pixel Resolution and System Size TeSA calculates the number of processing elements directly from a given technology s transistor density. From the effective die area (A eff ), the total number of transistors per monolithic chip is determined. This total transistor count is divided by the transistor count per processing element to determine the number of processing elements per chip (N pe ). This approach can provide a better approximation of system size than area scaling because area scaling for future technology may violate transistor density. The transistor density represents the maximum number of transistors in any given silicon area. Wiring area is considered by reducing the effective die area (A eff ) before calculation with transistor density. Pixel resolution (Res) for future SIMPil system is calculated with a pixel to processor, PPE, ratio. The following equations illustrate the models to calculate pixel resolution and system size. Res = NPE PPE o N PE 1 T A = ρ N eff tranpe 10

11 6.0 Results This section presents modeling results and evaluation of the SIMPil system in future technology. Workload characterization and architecture models are combined with technology parameters to perform detailed projections of system performance and efficiency metrics. An analysis of the design space under power density limitation is also presented. System Performance Important metrics to evaluate the SIMPil system in future technologies include system image resolution, clock frequency, power consumption, and instruction throughput. System image resolution describes the increase in the number of processing elements due to the increasing transistor density. For constant PPE ratio, integrating more processing elements in a single chip results in larger image resolution. Clock frequency and power consumption are interrelated and offer some insights on performance and resource utilization. Average system instruction throughput illustrates the overall performance to execute image-processing applications. Figure 5 shows current and projected system performance metrics. Current system clock rate for the SIMPil system can increase from 50 MHz at 2 Watts power consumption (800 nm) to a projected system clock rate of 1.8 GHz at 50 Watts power consumption (50 nm). The projected performance parameters are subjected to the chosen modeling constraint, which limits clock frequency to the power density limit, in order to ensure implementation feasibility for a given technology scenario. While power consumption does not grow linearly, the increase can be quantified as a rate of 3.2 Watts per year, a rate much less than the current microprocessor power consumption growth rate of 10 Watts per year [12]. Projected image resolutions and instruction throughputs suggest an increasing trend with future technology. Projected image resolution increases from a size of 526 pixels (23 x 23) to a larger size of 850K pixels (920 x 920). Instruction throughput grows from 1.18 Gops/s to more than 70 Tops/s. While the performance growth is not linear, the increase is roughly doubling every year. This upward trend demonstrates the suitability of the SIMPil architecture for GSI technology because while power increases within limits of technology, both performance and image resolution increase to handle larger computation sizes. Technology limits, such as interconnect wiring and power density, do not hinder the performance because SIMPil is an application-specific architecture with short wire interconnects. Image processing applications and algorithms map well to the SIMPil architecture and its SIMD execution model. The SIMPil design sustains higher instruction throughput with more processing elements instead of a more complex, uniprocessor system design. 11

12 Resolution (Pixels) Clock Frequency (MHz) Power (W) Throughtput (Gop/s) Feature Size (nm) Figure 5. SIMPil system performance in GSI technology. System capability in image resolution, clock frequency, power consumption and instruction throughput are illustrated. System Efficiency Power and area efficiency metrics delineate the tradeoffs between instruction throughput and resource utilization. Increasing power efficiency suggests more capability and parallelism in the system. Increasing area efficiency implies better component utilization for given system capabilities. Higher ratings in power and area efficiency metrics are coveted for future image processing systems because of technology limitations such as wire interconnects and power density. Small processing elements with modest capability is desired. Power consumption must be contained to maintain portability, battery life, and effectiveness of the system. Figure 6 illustrates the area and power efficiency ratings of the SIMPil system for current and future technologies. Projected ratings indicate an increasing trend for future technologies. This trend suggests the suitability of the architecture for GSI technology. The positive slope of the trend line shows improvements in both metrics for the SIMPil system. A poor system design that sacrifices area efficiency for power efficiency would have a negatively sloped trend line. This visualization method can be extended to other system implementations to determine the relationships between system efficiency metrics. 12

13 nm 70 nm Area Efficiency (Mop/s.mm2) nm 130 nm 150 nm 180 nm 250 nm 800 nm Power Efficiency (Mop/Joule) Figure 6. SIMPil system efficiency in current and future technologies Power Density Analysis The previous analyses assume the same SIMPil system design for technology projections. This section presents the frequency limit imposed by power density and the impact on system design. The system clock frequency (f c,sys ) is determined by a design factor ( design ) and gate delay (τ gate ). design f c, sys = τ design incorporates design techniques that govern the effective number of gates in the critical path. τ gate changes with technology and is dependent on transistor feature size. With a known f c,sys, the system power dissipation can be determined. It is therefore interesting to determine the power density limitations posed by GSI technology. The power density is expressed in terms of a maximum operating frequency (f c,power ). Beyond this frequency, the power dissipated exceeds the maximum power extractable from the chip. As a design choice for SIMPil, the design factor ( design ) can be varied to determine the maximum value before f c,sys exceeds f c,power. In Figure 7, the frequency at maximum power density (f c,power ) is plotted for different technologies. The shaded area is the region where the system operates within the allowed power density. The power density values that determine the shaded region are obtained from the semiconductor roadmap [11]. A family of f c,sys clock frequency curves is plotted versus feature sizes for different design factor. For any given technology, larger design increases f c,sys, which raises the system power dissipation. Larger design values indicate more aggressive design implementations. For the power density limit to be observed at any given technology, the design factor must be chosen within the shaded area. The optimal design, for each technology, is found at the intersection between f c,sys and f c,power. In 50 nm technology, the SIMPil design can be optimized gate 13

14 by increasing the design factor to from in the current implementation. This increase in design will extend power consumption to the limit set by the semiconductor roadmap. For the SIMPil system, lower frequency and lower power consumption is desired, and the current design does not need to change. The system instruction throughput in excess of 70 Tops/s shown in Figure 5 suggests sufficient processing capability for the image processing workloads at real-time frame rates (30 frames per second). The projected power consumption of 50 Watts for a 52,900 processor system remains below the power density limits in 50 nm technology Power Limit design = design = design = design = design = design = design = Clock Frequency (MHz) design = Feature Size (nm) Figure 7. System clock frequencies and power density limit for SIMPil. The area above the shaded region indicates a region of operation that exceeds projected power density limits. 7.0 Conclusions The SIMD Pixel processor (SIMPil) has been evaluated under a realistic image processing workload, characterized with high concurrency (>70%) and a well-balanced resource utilization. A single die in 50 nm technology provides for a total image resolution of 850K pixels (920 x 920), with a sustained system throughput in excess of 70 Tops/s. System power consumption is contained below 50 Watts for a 52,900 processor system. Moreover, SIMPil design choices are explored, and a more aggressive design is feasible before being limited by power density in 50 nm technology. These projected performance parameters demonstrate the suitability of the SIMPil architecture for GSI technology because performance and image resolution both increase while power consumption remains within technology limits. Future research will include detailed models of wire interconnect and pad resource consumption to offer a more accurate projection. 14

15 8.0 Acknowledgements The work was supported by the Defense Advanced Research Projects Agency (Low Power Electronics Contract: FY ), the National Science Foundation/Georgia Tech Packaging Research Center (Contract: EEC ), AFOSR and ARL. The authors extend thanks to the PICA research group, especially to Mr. Huy H. Cat, Dr. Abelardo López- Lagunas, and Mr. William H. Robinson III. The authors acknowledge the application development activity performed by Dr. José Luis Cruz-Rivera and his research group at the University of Puerto Rico in Mayagüez. 9.0 References [1] G. Baccarini, et al. Generalized Scaling Theory, IEEE Trans. on Electron Devices, pp , April 1984 [2] K. E. Batcher, Design of the Massively Parallel Processor, IEEE Trans. on Computer, C9, v.9, pp , 1980 [3] H. H. Cat, et.al. SIMPil: An OE Integrated SIMD Architecture for Focal Plane Processing Applications, Massively Parallel Processing using Optical Interconnection (MPPOI-96), pp.44-52, 1996 [4] A. P. Chandraskan, et.al. Low-power CMOS digital design. IEEE Journal on Solid-State Circuits,27,pp [5] K. Diefendorff and R. Dubey. How Multimedia Workloads Will Change Processor Design, IEEE Computer, Vol. 30, No. 9, September 1997, pp [6] E. Fossum, Digital Camera System on a Chip, IEEE Micro, pp.8-15, May [7] A. Gentile, et al. Real-Time Image Processing on a Focal Plane SIMD Array, to appear in Proceedings of the Seventh International Workshop on Parallel and Distributed Real-Time Systems, San Juan, Puerto Rico, [8] W. D. Hillis, The Connection Machine, The MIT Press, 1985 [9] J.D. Meindl, Low Power Microelectronics: Retrospect and Prospect, Proceedings IEEE, Vol. 83, No. 4, pp , April [10] J. R. Nickolls, The Design of the MasPar MP-1: A cost-effective Massively Parallel Computer, IEEE Digest of Papers - ComCom, pp.25-28, 1990 [11] The National Technology Roadmap for Semiconductors, Semiconductor Industry Association, [12] V.G. Oklobdzija, Architectural Tradeoffs for Low Power, Intl. Symp. on Computer Architecture, June [13] S.Palacharla, et. al. Complexity-Effective Superscalar Processors, Intl. Symp. on Computer Architecture, 1997, pp [14] SIMPil Home Page, [15] N.H.E. Weste, K. Eshraghian, Principles of CMOS VLSI System Design: A System Perspective, Addison-Wesley, Reading, Massachusetts, [16] D. S. Wills, et al., Processing Architectures for Smart Pixel Systems, IEEE Journal of Selected Topics in Quantum Electronics, v.2 n.1, April 1996, pp

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton,

More information

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141

ECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141 ECE 637 Integrated VLSI Circuits Introduction EE141 1 Introduction Course Details Instructor Mohab Anis; manis@vlsi.uwaterloo.ca Text Digital Integrated Circuits, Jan Rabaey, Prentice Hall, 2 nd edition

More information

by the vision process. In the course of effecting the preprocessing

by the vision process. In the course of effecting the preprocessing CCD FOCAL-PLANE REAL-TIME IMAGE PROCESSOR E-S. Eid and E.R. Fossum Department of Electrical Engineering 1312 S.W. Mudd Building Columbia University New York, New York 10027 ABSTRACT A focal-plane-array

More information

EE241 - Spring 2000 Advanced Digital Integrated Circuits. Practical Information

EE241 - Spring 2000 Advanced Digital Integrated Circuits. Practical Information EE24 - Spring 2000 Advanced Digital Integrated Circuits Tu-Th 2:00 3:30pm 203 McLaughlin Practical Information Instructor: Borivoje Nikolic 570 Cory Hall, 3-9297, bora@eecs.berkeley.edu Office hours: TuTh

More information

EE586 VLSI Design. Partha Pande School of EECS Washington State University

EE586 VLSI Design. Partha Pande School of EECS Washington State University EE586 VLSI Design Partha Pande School of EECS Washington State University pande@eecs.wsu.edu Lecture 1 (Introduction) Why is designing digital ICs different today than it was before? Will it change in

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Introduction 1. GENERAL TRENDS. 1. The technology scale down DEEP SUBMICRON CMOS DESIGN

Introduction 1. GENERAL TRENDS. 1. The technology scale down DEEP SUBMICRON CMOS DESIGN 1 Introduction The evolution of integrated circuit (IC) fabrication techniques is a unique fact in the history of modern industry. The improvements in terms of speed, density and cost have kept constant

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras

CAD for VLSI. Debdeep Mukhopadhyay IIT Madras CAD for VLSI Debdeep Mukhopadhyay IIT Madras Tentative Syllabus Overall perspective of VLSI Design MOS switch and CMOS, MOS based logic design, the CMOS logic styles, Pass Transistors Introduction to Verilog

More information

Fundamentals of Computer Design

Fundamentals of Computer Design CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining

More information

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes MSEE Curriculum All MSEE students are required to take the following two core courses: 3531-571 Linear systems 3531-507 Probability and Random Processes The course requirements for students majoring in

More information

Package level Interconnect Options

Package level Interconnect Options Package level Interconnect Options J.Balachandran,S.Brebels,G.Carchon, W.De Raedt, B.Nauwelaers,E.Beyne imec 2005 SLIP 2005 April 2 3 Sanfrancisco,USA Challenges in Nanometer Era Integration capacity F

More information

Three DIMENSIONAL-CHIPS

Three DIMENSIONAL-CHIPS IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) ISSN: 2278-2834, ISBN: 2278-8735. Volume 3, Issue 4 (Sep-Oct. 2012), PP 22-27 Three DIMENSIONAL-CHIPS 1 Kumar.Keshamoni, 2 Mr. M. Harikrishna

More information

EECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510)

EECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510) A V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications Hui Zhang, Vandana Prabhu, Varghese George, Marlene Wan, Martin Benes, Arthur Abnous, and Jan M. Rabaey EECS Dept., University

More information

ECE 486/586. Computer Architecture. Lecture # 2

ECE 486/586. Computer Architecture. Lecture # 2 ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:

More information

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling Keigo Mizotani, Yusuke Hatori, Yusuke Kumura, Masayoshi Takasu, Hiroyuki Chishiro, and Nobuyuki Yamasaki Graduate

More information

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm International Journal of Engineering Research and General Science Volume 3, Issue 4, July-August, 15 ISSN 91-2730 A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

More information

Microelettronica. J. M. Rabaey, "Digital integrated circuits: a design perspective" EE141 Microelettronica

Microelettronica. J. M. Rabaey, Digital integrated circuits: a design perspective EE141 Microelettronica Microelettronica J. M. Rabaey, "Digital integrated circuits: a design perspective" Introduction Why is designing digital ICs different today than it was before? Will it change in future? The First Computer

More information

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS

FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS 1 RONNIE O. SERFA JUAN, 2 CHAN SU PARK, 3 HI SEOK KIM, 4 HYEONG WOO CHA 1,2,3,4 CheongJu University E-maul: 1 engr_serfs@yahoo.com,

More information

The Design of the KiloCore Chip

The Design of the KiloCore Chip The Design of the KiloCore Chip Aaron Stillmaker*, Brent Bohnenstiehl, Bevan Baas DAC 2017: Design Challenges of New Processor Architectures University of California, Davis VLSI Computation Laboratory

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

VLSI Digital Signal Processing

VLSI Digital Signal Processing VLSI Digital Signal Processing EEC 28 Lecture Bevan M. Baas Tuesday, January 9, 28 Today Administrative items Syllabus and course overview My background Digital signal processing overview Read Programmable

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

ECE 5745 Complex Digital ASIC Design Topic 7: Packaging, Power Distribution, Clocking, and I/O

ECE 5745 Complex Digital ASIC Design Topic 7: Packaging, Power Distribution, Clocking, and I/O ECE 5745 Complex Digital ASIC Design Topic 7: Packaging, Power Distribution, Clocking, and I/O Christopher Batten School of Electrical and Computer Engineering Cornell University http://www.csl.cornell.edu/courses/ece5745

More information

More Course Information

More Course Information More Course Information Labs and lectures are both important Labs: cover more on hands-on design/tool/flow issues Lectures: important in terms of basic concepts and fundamentals Do well in labs Do well

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem. The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults

More information

1 Introduction. 2 Parallel Approaches for Medical Image Registration using SIMD Processor Arrays. 2.1 SIMD Processor Array Architecture

1 Introduction. 2 Parallel Approaches for Medical Image Registration using SIMD Processor Arrays. 2.1 SIMD Processor Array Architecture Accelerating Medical Image Registration Using a SIMD Arra I. K. Jeong 1, M. S. Kang 1, C. H. Kim 2 and J. M. Kim 1,* 1 School of Electrical Engineering, Universit of Ulsan, Ulsan, South Korea 2 School

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

The Impact of Wave Pipelining on Future Interconnect Technologies

The Impact of Wave Pipelining on Future Interconnect Technologies The Impact of Wave Pipelining on Future Interconnect Technologies Jeff Davis, Vinita Deodhar, and Ajay Joshi School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332-0250

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System

More information

Low Power PLAs. Reginaldo Tavares, Michel Berkelaar, Jochen Jess. Information and Communication Systems Section, Eindhoven University of Technology,

Low Power PLAs. Reginaldo Tavares, Michel Berkelaar, Jochen Jess. Information and Communication Systems Section, Eindhoven University of Technology, Low Power PLAs Reginaldo Tavares, Michel Berkelaar, Jochen Jess Information and Communication Systems Section, Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands {regi,michel,jess}@ics.ele.tue.nl

More information

Fundamentals of Computer Design

Fundamentals of Computer Design Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture International Journal of Computer Trends and Technology (IJCTT) volume 5 number 5 Nov 2013 Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

More information

DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE

DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE Roger Luis Uy College of Computer Studies, De La Salle University Abstract: Tick-Tock is a model introduced by Intel Corporation in 2006 to show the improvement

More information

DIGITAL IMAGE PROCESSING WRITTEN REPORT ADAPTIVE IMAGE COMPRESSION TECHNIQUES FOR WIRELESS MULTIMEDIA APPLICATIONS

DIGITAL IMAGE PROCESSING WRITTEN REPORT ADAPTIVE IMAGE COMPRESSION TECHNIQUES FOR WIRELESS MULTIMEDIA APPLICATIONS DIGITAL IMAGE PROCESSING WRITTEN REPORT ADAPTIVE IMAGE COMPRESSION TECHNIQUES FOR WIRELESS MULTIMEDIA APPLICATIONS SUBMITTED BY: NAVEEN MATHEW FRANCIS #105249595 INTRODUCTION The advent of new technologies

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Multi-Core Microprocessor Chips: Motivation & Challenges

Multi-Core Microprocessor Chips: Motivation & Challenges Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005

More information

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

Fundamentals of Computers Design

Fundamentals of Computers Design Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

VLSI Design Automation. Maurizio Palesi

VLSI Design Automation. Maurizio Palesi VLSI Design Automation 1 Outline Technology trends VLSI Design flow (an overview) 2 Outline Technology trends VLSI Design flow (an overview) 3 IC Products Processors CPU, DSP, Controllers Memory chips

More information

Baseline V IRAM Trimedia. Cycles ( x 1000 ) N

Baseline V IRAM Trimedia. Cycles ( x 1000 ) N CS 252 COMPUTER ARCHITECTURE MAY 2000 An Investigation of the QR Decomposition Algorithm on Parallel Architectures Vito Dai and Brian Limketkai Abstract This paper presents an implementation of a QR decomposition

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

On GPU Bus Power Reduction with 3D IC Technologies

On GPU Bus Power Reduction with 3D IC Technologies On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The

More information

Introduction. Summary. Why computer architecture? Technology trends Cost issues

Introduction. Summary. Why computer architecture? Technology trends Cost issues Introduction 1 Summary Why computer architecture? Technology trends Cost issues 2 1 Computer architecture? Computer Architecture refers to the attributes of a system visible to a programmer (that have

More information

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 84 CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 3.1 INTRODUCTION The introduction of several new asynchronous designs which provides high throughput and low latency is the significance of this chapter. The

More information

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology Course Presentation Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology Video Coding Correlation in Video Sequence Spatial correlation Similar pixels seem

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications

More information

Neuromorphic Computing: Our approach to developing applications using a new model of computing

Neuromorphic Computing: Our approach to developing applications using a new model of computing Neuromorphic Computing: Our approach to developing applications using a new model of computing David J. Mountain Senior Technical Director Advanced Computing Systems Research Program Background Info Outline

More information

EE5780 Advanced VLSI CAD

EE5780 Advanced VLSI CAD EE5780 Advanced VLSI CAD Lecture 1 Introduction Zhuo Feng 1.1 Prof. Zhuo Feng Office: EERC 513 Phone: 487-3116 Email: zhuofeng@mtu.edu Class Website http://www.ece.mtu.edu/~zhuofeng/ee5780fall2013.html

More information

Low-Power Technology for Image-Processing LSIs

Low-Power Technology for Image-Processing LSIs Low- Technology for Image-Processing LSIs Yoshimi Asada The conventional LSI design assumed power would be supplied uniformly to all parts of an LSI. For a design with multiple supply voltages and a power

More information

Final Review. Image Processing CSE 166 Lecture 18

Final Review. Image Processing CSE 166 Lecture 18 Final Review Image Processing CSE 166 Lecture 18 Topics covered Basis vectors Matrix based transforms Wavelet transform Image compression Image watermarking Morphological image processing Segmentation

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch

RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,

More information

Linking Layout to Logic Synthesis: A Unification-Based Approach

Linking Layout to Logic Synthesis: A Unification-Based Approach Linking Layout to Logic Synthesis: A Unification-Based Approach Massoud Pedram Department of EE-Systems University of Southern California Los Angeles, CA February 1998 Outline Introduction Technology and

More information

Embedded many core sensor-processor system

Embedded many core sensor-processor system Efficiency of our computational infrastructure Embedded systems (P-ITEEA_0033) Embedded many core sensor-processor system Lecture 4 2, March, 2016. 22 nm technology 1.2 billion transistors 3.4 GHz clock

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Flexible wireless communication architectures

Flexible wireless communication architectures Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April

More information

VLSI Design Automation

VLSI Design Automation VLSI Design Automation IC Products Processors CPU, DSP, Controllers Memory chips RAM, ROM, EEPROM Analog Mobile communication, audio/video processing Programmable PLA, FPGA Embedded systems Used in cars,

More information

Unleashing the Power of Embedded DRAM

Unleashing the Power of Embedded DRAM Copyright 2005 Design And Reuse S.A. All rights reserved. Unleashing the Power of Embedded DRAM by Peter Gillingham, MOSAID Technologies Incorporated Ottawa, Canada Abstract Embedded DRAM technology offers

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

ADVANCED FPGA BASED SYSTEM DESIGN. Dr. Tayab Din Memon Lecture 3 & 4

ADVANCED FPGA BASED SYSTEM DESIGN. Dr. Tayab Din Memon Lecture 3 & 4 ADVANCED FPGA BASED SYSTEM DESIGN Dr. Tayab Din Memon tayabuddin.memon@faculty.muet.edu.pk Lecture 3 & 4 Books Recommended Books: Text Book: FPGA Based System Design by Wayne Wolf Overview Why VLSI? Moore

More information

VERY LOW POWER MICROPROCESSOR CELL

VERY LOW POWER MICROPROCESSOR CELL VERY LOW POWER MICROPROCESSOR CELL Puneet Gulati 1, Praveen Rohilla 2 1, 2 Computer Science, Dronacharya College Of Engineering, Gurgaon, MDU, (India) ABSTRACT We describe the development and test of a

More information

VLSI Design Automation

VLSI Design Automation VLSI Design Automation IC Products Processors CPU, DSP, Controllers Memory chips RAM, ROM, EEPROM Analog Mobile communication, audio/video processing Programmable PLA, FPGA Embedded systems Used in cars,

More information

Computer Organization and Assembly Language

Computer Organization and Assembly Language Computer Organization and Assembly Language Week 01 Nouman M Durrani COMPUTER ORGANISATION AND ARCHITECTURE Computer Organization describes the function and design of the various units of digital computers

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017 Design of Low Power Adder in ALU Using Flexible Charge Recycling Dynamic Circuit Pallavi Mamidala 1 K. Anil kumar 2 mamidalapallavi@gmail.com 1 anilkumar10436@gmail.com 2 1 Assistant Professor, Dept of

More information

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism

More information

Texture. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors

Texture. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors Texture The most fundamental question is: How can we measure texture, i.e., how can we quantitatively distinguish between different textures? Of course it is not enough to look at the intensity of individual

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 09, 2016 ISSN (online): 2321-0613 A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM Yogit

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more

More information

Register Organization and Raw Hardware. 1 Register Organization for Media Processing

Register Organization and Raw Hardware. 1 Register Organization for Media Processing EE482C: Advanced Computer Organization Lecture #7 Stream Processor Architecture Stanford University Thursday, 25 April 2002 Register Organization and Raw Hardware Lecture #7: Thursday, 25 April 2002 Lecturer:

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine

More information

Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation

Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation LETTER IEICE Electronics Express, Vol.11, No.5, 1 6 Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation Liang-Hung Wang 1a), Yi-Mao Hsiao

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Understanding Sources of Inefficiency in General-Purpose Chips

Understanding Sources of Inefficiency in General-Purpose Chips Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex Solomatnikov Benjamin Lee Stephen Richardson Christos Kozyrakis Mark Horowitz GP Processors

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

ECE 172A: Introduction to Intelligent Systems: Machine Vision, Fall Midterm Examination

ECE 172A: Introduction to Intelligent Systems: Machine Vision, Fall Midterm Examination ECE 172A: Introduction to Intelligent Systems: Machine Vision, Fall 2008 October 29, 2008 Notes: Midterm Examination This is a closed book and closed notes examination. Please be precise and to the point.

More information

Artifacts and Textured Region Detection

Artifacts and Textured Region Detection Artifacts and Textured Region Detection 1 Vishal Bangard ECE 738 - Spring 2003 I. INTRODUCTION A lot of transformations, when applied to images, lead to the development of various artifacts in them. In

More information

Calibrating Achievable Design GSRC Annual Review June 9, 2002

Calibrating Achievable Design GSRC Annual Review June 9, 2002 Calibrating Achievable Design GSRC Annual Review June 9, 2002 Wayne Dai, Andrew Kahng, Tsu-Jae King, Wojciech Maly,, Igor Markov, Herman Schmit, Dennis Sylvester DUSD(Labs) Calibrating Achievable Design

More information

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,

More information

Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems

Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems Valentin Gies and Thierry M. Bernard ENSTA, 32 Bd Victor 75015, Paris, FRANCE, contact@vgies.com,

More information

By Charvi Dhoot*, Vincent J. Mooney &,

By Charvi Dhoot*, Vincent J. Mooney &, By Charvi Dhoot*, Vincent J. Mooney &, -Shubhajit Roy Chowdhury*, Lap Pui Chau # *International Institute of Information Technology, Hyderabad, India & School of Electrical and Computer Engineering, Georgia

More information

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Jian Chen, Ruihua Peng, Yuzhuo Fu School of Micro-electronics, Shanghai Jiao Tong University, Shanghai 200030, China {chenjian,

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

Parameterized Convolution Filtering in a Field Programmable Gate Array

Parameterized Convolution Filtering in a Field Programmable Gate Array Parameterized Convolution Filtering in a Field Programmable Gate Array Richard G. Shoup Interval Research Palo Alto, California 94304 Abstract This paper discusses the simple idea of parameterized program

More information