Impact of Power Density Limitation in Gigascale Integration for the SIMD Pixel Processor
|
|
- Adam Jackson
- 5 years ago
- Views:
Transcription
1 Impact of Power Density Limitation in Gigascale Integration for the SIMD Pixel Processor Sek M. Chai, Antonio Gentile, D. Scott Wills School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia {sek, gentile, Abstract Gigascale Integration (GSI) enables a new generation of monolithic focal plane processing systems built with billion-transistor chips. As this technology matures, fundamental technology limitations on wire interconnects and power dissipation will become the performance bottleneck. This paper presents system performance projections for GSI technologies under these constraints. Architectural models and workload characterization are integrated to identify viable future system implementations. The SIMD Pixel processor (SIMPil) is selected as the architecture for evaluation, and an image processing application suite is programmed to characterize the workload. Projections for SIMPil systems show that over three orders of magnitude improvement is achievable by 2012 in both system throughput and image resolution. System power consumption is contained below 50 Watts for a 52,900 processor system in 50 nm technology. The SIMPil architecture design space is explored, and opportunities for more aggressive designs within power density limits are examined. 1.0 Introduction The National Technology Roadmap for Semiconductors (NTRS) projects a two-billiontransistor monolithic chip by 2012 [11]. At Gigascale transistor-density levels, the power consumption of a chip can easily extend to a level beyond its heat extraction or battery supply capabilities. If performance were expected to double every year, power dissipation, which is growing at approximately 10 Watts per year for general-purpose microprocessors, can increase beyond power savings gained by technology [9][12]. To improve the balance between power consumption and performance, architecture and application must be studied together to find the full system impact on targeted domains. This paper evaluates the SIMD Pixel processor (SIMPil), a fine-grain architecture for focal plane image processing. An application suite covering different aspects of image processing (image compression, filtering, and analysis) is programmed for SIMPil [7]. Applications are simulated to extract average workload characteristics, instruction histograms, and system concurrency to determine functional unit utilization under realistic operating conditions. This approach provides an accurate estimation of power dissipation and efficiency of each functional unit. 1
2 VLSI layout information are extracted from the current design and expressed in terms of silicon area, transistor count, and total capacitance for each functional unit. This information forms the system implementation description for analysis. Technology parameters are extracted from [11] and used to define the technology scenarios. Combining system and technology descriptions, system performance parameters such as instruction throughput, image resolution, area consumption, and power dissipation are projected. Clock frequency in a technology scenario is chosen in agreement to the limit imposed by power density, instead of using clock frequency values as projected in the roadmap. SIMPil is an embedded architecture that is a good candidate for focal plane image processing. Unlike current general-purpose microprocessors, the architecture reduces datapath complexity by specializing for image processing domains. Interprocessor communication paths are near-neighbor to maintain short wire lengths. In comparison, new architectural features in existing general-purpose microprocessors are offering diminishing returns as complexity and long broadcast wires become design bottlenecks [13]. Focal plane image processing applications are stream-oriented, and large caches in general-purpose microprocessors are not efficiently used because every stream element is read exactly once [5]. Many different SIMD systems have been proposed [2][8][10], which offer the required I/O and computational throughput to handle image processing applications. However, their performance and generality come at the expense of I/O coupling, power consumption, and portability. As the goal of this paper is to determine performance parameters of SIMPil in future technology, only a single system is evaluated. This paper will show that SIMPil can maintain the balance in performance within the power density limits imposed by technology scenarios projected by [11]. The rest of the paper is organized as follows. Section 2 describes the architecture of the SIMPil system being developed at Georgia Tech. Section 3 presents a table of symbols and definitions. Section 4 resents a profile of the image processing applications implemented on SIMPil and the workload characteristics. Section 5 introduces the modeling effort incorporated in a Technology Scenario Analyzer (TeSA) tool to project system parameters for different technologies using semiconductor roadmap projections. Section 6 presents results and evaluation. Conclusions are offered in Section SIMPil System Architecture The SIMD Pixel Processor (SIMPil) is a focal plane image processing system which employs area-array I/O to access directly to the processors. The SIMPil design explores the benefits of integrating an image sensor array with a high-performance multiprocessorcomputing plane. This monolithic integration of image sensors and digital processing elements is the key-feature of the SIMPil system. In SIMPil, the image stream flows directly from the focal plane into the processing plane, retaining its spatial correlation, as depicted in Figure 1. 2
3 ACU Figure 1: The SIMPil system. Image streams are optically focussed into the sensor array, and hence mapped onto the processing engine in a single operation. The SIMPil architecture consists of a mesh of SIMD processors. A block diagram for a 16- bit implementation is illustrated in Figure 2. The instruction set architecture allows a single processing element (PE) to address a 4 4 array of image sensors. Each processor incorporates an analog to digital converter to convert light intensities, incident on the sensors, into digital values. The SAMPLE instruction simultaneously collects all sensor values and makes them available for further processing. Each processing element is a simplified RISC processor that contains the following functional units (FU): 16 bit ALU with adder/subtractor and barrel shifter; Multiply-accumulator unit with a 32 bit accumulator register; 16 three-ported general purpose and special registers; 64 words of local memory (256 maximum words); Communication and serial I/O units; Masking unit to control PE activity. Neighboring PEs PE Communication Unit Arithmetic, Logical, and Shift Unit Register File 16 by 16 bit 2 read, 1 write Multiply Accumulator Image Sensor Subarray ADC Local Memory (64 words) Processor Array Special Registers & I/O Decoder Figure 2: Block diagram of a 16-bit implementation of a SIMPil PE. Each PE is directly interfaced to a small array of image sensors. PE's are connected together via a NEWS mesh. 3
4 SIMPil PEs are connected through a NEWS network. Any entry in the register file can be used as source or destination in a communication instruction. In addition, constant data can also be received (or transmitted) serially through a specialized serial I/O unit. Data reception or transmission occurs without interrupting the normal PE operation. All instructions execute in a single cycle. Figure 3: Symbolic layout of a SIMPil16 prototype. The chip measures mm 2, and it packs about 38,590 transistors. It is fabricated in HP 0.8 µm CMOS process and housed in a 132-pin PGA. Early prototyping efforts have proved the feasibility of direct coupling of a simple processing core with a sensor device [3]. A 16 bit prototype of a SIMPil PE was designed in 0.8 µm CMOS process and fabricated through MOSIS. The prototypes were successfully tested and run at 25 MHz. The symbolic layout of the prototype PE is shown in Figure 3. The prototype PE measures mm 2, and contains a total of 38,590 transistors. SIMPil functional units are specified in Table 1, in terms of silicon area and transistor number. A single PE is estimated to consume about 44.1 mw at 5 V, running at 25 MHz, over the entire application workload. 4
5 Table 1: SIMPil FUs specifications for the 16-bit implementation. Functional Units Area (mm 2 ) Number of Transistors MACC ,844 MEMORY ,098 REGFILE ,974 COMM UNIT SERIAL I/O ,006 ALU ,620 BARREL SHIFTER ,118 SLEEP UNIT DECODER BUS DRIVER Large arrays of SIMPil PEs can be simulated using the SIMPil Simulator [14]. This software tool is an instruction level simulator, running under Windows95. Applications for the SIMPil system can be edited, assembled, executed, and debugged within this single integrated workbench. Metering facilities are also built in the simulator to determine the concurrency level, memory usage, and instruction histograms during execution. 3.0 Glossary Table 2. List of symbols and their definitions A eff Effective die area n gate Logic gates in critical path A max Maximum die area * N tranpe Number of transistors per PE A pad Total pad area N pe Total number of PEs A wire Total wiring area η power Power efficiency metric α Transistor activity factor η area Area efficiency metric C o Output load capacitance P clk Power from clock distribution C Htree Total capacitance in H-Tree P eff Effective total power C w Wiring capacitance P max Maximum power dissipation* design Design factor P pad Power dissipated in pads ε o Permittivity in vacuum PPE Pixels to Processor ratio ε r Dielectric permittivity* Res System image resolution in pixels E i Effective energy consumption ρ tran Maximum transistor density* f c Operating clock frequency S Scaling factor f c,power System clock frequency τ gate Single gate delay* f c,power Power-limited clock frequency V Minimum logic Vdd* IPC Instruction per cycle W i Workload factor I T Avg system instruction throughput W clk Clock wire width L pe Dimension of PE U System Utilization N FU Number of functional units *Indicated technology values obtained from [11]. Other values are derived, modeled, or calculated. 5
6 4.0 Workload Characterization The SIMPil architecture is designed for image and video processing applications. In general, this class of applications is computationally intensive and requires high throughput to handle the massive data flow in real-time. However, these applications offer a large degree of data parallelism, which is not usually exploited by sequential image processing systems. SIMPil combines focal plane image acquisition with a SIMD execution model to exploit available data parallelism and remove I/O bottleneck. Image frames are available simultaneously at each PE in the system, and their spatial correlation is retained. To evaluate the set of architectural design choices implemented in the SIMPil system, the following image-processing applications have been implemented and simulated using the SIMPil16 Simulator. Details on the implementations are offered elsewhere [3][7]. Spatial filtering. The implementation performs 2D convolution-based filtering. Operations such as shadowing, edge detection, and smoothing are executed using appropriate 3 3-filter masks. Discrete Fourier transform. 2D Discrete Fourier Transform has been implemented using a matrix multiplication algorithm. The original image is transformed row first then columns. The weight matrices are preloaded into the system, and they are rearranged to support the nearestneighbor communication scheme available on SIMPil. Fixed-point arithmetic is used to implement the algorithm. Morphological filtering. Basic morphological operations (erosion, dilation) have been implemented using a 3 3 structuring element. These operations are implemented as intersection and union of shifted versions of the original image. More complex operations, such as opening, closing, inside edge detection, and skeletonization are then implemented by combining the two basic operations. Wavelet decomposition. Discrete wavelet decomposition has been implemented for fingerprint compression and archival. Standard Daubechie's filters have been used to implement the low/high pass filters. A row-column scheme decomposes a gray-level image into 61 frequency bands. Image rotation. A parallel rotation algorithm has been implemented to perform fast rotations of binary images. The rotation angle γ is first expressed as π π γ = α + n, α 0,, and = n. 2 Rotations are then are executed in two stages: a skew-based rotation of the angle α, and then a set of n fast ninety-degree rotations. This scheme is well suited for a SIMD implementation with regular communication patterns. Image labeling. This implementation is based on a cluster analysis algorithm. It is used to classify objects in a binary image on the basis of object diameter. The objects are then labeled accordingly. Quadtree region representation. This implementation operates on binary images to generate a quadtree representation. Quadtrees are based on the principle of recursive decomposition of space. The image is first decomposed in four equal-sized quadrants. If a quadrant is not uniform (entirely filled/empty), it is further decomposed in four more subquadrants. The 6
7 decomposition stops when uniform quadrants are encountered, or the quadrant contains a single pixel. Region identification. In this implementation, a small region of interest is identified using chromatic information. Several stages are executed to complete the task, including binarization, quadtree generation, region isolation, and region zooming. Larger applications, such as JPEG image encoding, and region clustering are currently being implemented, integrating various components into larger applications. The above applications were simulated and the instruction histograms were generated. As this paper focuses on the design of SIMPil PEs, scalar instructions have been excluded from the analysis. The instructions executed in each PE have been divided among the different functional units, and the results are listed in Table 3, along with the average system utilization. Table 3: Workload characterization. Average system and functional unit utilizations are given for each application. Only instructions executed in the PE are considered to compute the utilization of each FU. Applications System Functional Units Utilization (%) Utilization (%) ALU MACC SHIFT MEM COMM MASK PIXEL IED SKL LBL WLT QTREE SKEW RING SF DFT REGION IED: Inside Edge Detection SKL: Skeletonization LBL: Image Labeling WLT: Wavelet Decomposition QTREE: Quad Tree Decomposition ROT Skew: Skew-based Rotation ROT Ring: 90 Ring Rotation SF: Spatial Filtering DFT: Discrete Fourier Transform REGION: Region Identification This application set characterizes a typical workload for the SIMPil architecture. Two elements in particular will be considered in the architecture models discussed in the next section: the system utilization (U), and the workload factor (W i ) for a SIMPil PE. These values are averaged over the entire set of applications and are listed in Table 4. This characterization is done on a per cycle basis because the power analysis is a rate measurement of energy consumption. SIMPil performances over the workload are detailed elsewhere [7]. Table 4: Average system utilization and workload factors for a SIMPil PE. System Workload Factors (W i ) Utilization (U) ALU MACC SHIFT MEM COMM MASK PIXEL 71.61% 33.60% 3.43% 5.04% 28.46% 14.18% 14.85% 0.44% 7
8 5.0 Architecture Modeling A TEchnology Scenario Analyzer (TeSA) tool has been built to project future system performance. TeSA incorporates application characteristics, such as system utilization (U) and workload factor (W i ), with architectural and technology models. Architectural models are defined by VLSI layout information and expressed in terms of silicon area, transistor count, and total capacitance for each functional unit. Technology parameters are extracted from semiconductor roadmap and used to define the technology scenarios. This section presents salient features of TeSA. Power and area reduction factors are described along with capacitance calculation and technology scaling. System sizes are calculated from transistor densities. Selected performance parameters such as clock frequency, power dissipation, system pixel resolution, and sustained throughput are determined. Clock Frequency Model SIMPil is evaluated in terms of power efficiency (η power ) and area efficiency (η area ) metrics by considering throughput per power consumed (Mops/Joule) and throughput per silicon area consumed (Mops/s mm 2 ). The following equations illustrate these metrics: η power I T f = P eff c η area IT f = A The efficiency metrics are functions of instruction throughput, clock frequency, and resource cost such as power and area. The system instruction throughput (I T ) is calculated from the average concurrency of the system (U), the single PE instruction throughput (IPC), and the total number of processing elements (N pe ). P eff is effective power calculated from maximum system power (P max ) reduced by power consumed from pad and clock distribution. A eff is effective silicon area consumed, and it is calculated from maximum die size (A max ) reduced by area consumed for pads, bus wiring, and inter-node routing. The system clock frequency (f c,sys ) is determined from the critical path gate depth (n gate ), and a single gate delay (τ gate ). This value does not account for the limit posed by the maximum power dissipated from a chip by a heat sink. A maximum clock frequency (f c,power ) can be calculated from the maximum power density for SIMPil. f c,power is a function of the application workload factor (W i ) and the effective energy consumption (E i ). f c,sys and f c,power are described by the following equations: f c, sys = n gate 1 τ gate In TeSA, the operating clock frequency (f c ) is set as: ( f f ) f c = min c, sys, c, power eff P fc power = max, N PE c N FU i 1 EiWi This approach ensures that the operating clock frequency is below the upper bound set by power density limits. As a design choice for SIMPil, the clock frequency is not set as the maximum frequency possible in a given technology, but below a value set by power density. 8
9 Effective Area and Power Models TeSA includes the effects of area and power consumptions due to I/O pads and wiring interconnects. The effective area and power available for the system are calculated with the following equations. A eff = A max Apad Awire P eff = P max P pad P clk Area consumed by I/O pads (A pad ) is determined as a percentage of total available area (A max ). A 0.8 µm output pad area is used as a baseline, and the appropriate percentage reduction for future technology is applied. Area consumed from internal wiring (A wire ) is also calculated as percentage reductions, with the 0.8 µm implementation as a baseline. Power dissipated in I/O pads (P pad ) is determined as a percentage of total available power (P max ). H-SPICE simulations are used to determine power for a 0.8 µm output pad through a 64-pin PGA. This value is used as a baseline to calculate P pad for future technologies. Power dissipated in distributing the clock (P clk ) can be a large portion of the power budget. For the SIMPil system, a H-Tree clock distribution scheme [15] is used, as illustrated in Figure 4. The H-Tree provides a well-balanced signal propagation scheme for clock distribution. The signal paths to the next H-Tree level are equal in length. Line drivers are scaled for each level of the H-Tree proportionally to the signal path length. Total output capacitance is calculated including wire and output loads. Line Driver PE dimension 2nd level Htree 1st level Htree Figure 4. H-Tree clock distribution scheme for multi-node SIMPil System. Scaled line drivers and wire lengths are calculated in terms of PE dimensions and system size. The following equations illustrate the calculation of total capacitance and power dissipation for the entire H-Tree. The total capacitance, C Htree, is the aggregate capacitance for each H-Tree level. log CHtree = 4 N PE 1 ε ε i C 2 + on PE r owclklpe i= 0 i 2 The number of H-Tree levels is given by log 4 N PE 1. For each H-Tree level, a capacitance is calculated as the sum of two terms. The first term is dependent on signal wire length. The second term is total output capacitance from the line drivers. Power for clock distribution will subsequently increase substantially with advancing technology as the number of processing elements (N PE ) increases. P clk is given by. P clk = C Htree V 2 f c 9
10 Capacitance Scaling and Energy Consumption Models A SIMPil processing element is divided into the following functional units: ALU, multiply accumulate unit (MACC), barrel shifter, register file, on-chip memory, communication unit, instruction decoder, sleep unit, and bus drivers. For each unit, the load capacitance (C o ) and wire capacitance (C w ) are extracted from the implemented 0.8 µm design using the MAGIC VLSI layout tool kit and H-SPICE. TeSA adopts two different scaling methodologies for transistor load capacitance and wires to account for different scaling properties of wire interconnect and transistor drain/gate capacitances. The following equations describe the load and wire capacitance scaling. C ε 1 ' ' r ' w = C C w o = Co ε r s s In the above equation, the tick marks indicate values in future technology. The wire capacitance scales with the improvements in permittivity as well as reduced wire length with smaller feature sizes. Because a SIMPil processing element communicates only with its neighbors through near-neighbor interconnection network, global communication wires are ignored. Output load capacitance scales with the feature-size scaling factor (S) [1]. Effective energy consumption during transistor switching (E i ) is calculated with the following equation. E ( C C ) V 2 i = α w + A transistor activity (α) is assumed for every functional unit. The application workload utilization (W i ) is used to determine the activity workload of each functional unit. Groups of functional units that are active during different instruction types are formed. Active functional units contribute to energy consumption during the operating cycle. For example, an ALU operation requires the ALU, register file, and bus drivers to be active. In comparison, a LOAD operation requires the memory, register file, and bus drivers to be active. In each instruction group, the energy terms E i of each functional unit are summed, each in proportion to the activity of that unit. These sums are used to determine the operating clock frequency (f c ) described earlier. Pixel Resolution and System Size TeSA calculates the number of processing elements directly from a given technology s transistor density. From the effective die area (A eff ), the total number of transistors per monolithic chip is determined. This total transistor count is divided by the transistor count per processing element to determine the number of processing elements per chip (N pe ). This approach can provide a better approximation of system size than area scaling because area scaling for future technology may violate transistor density. The transistor density represents the maximum number of transistors in any given silicon area. Wiring area is considered by reducing the effective die area (A eff ) before calculation with transistor density. Pixel resolution (Res) for future SIMPil system is calculated with a pixel to processor, PPE, ratio. The following equations illustrate the models to calculate pixel resolution and system size. Res = NPE PPE o N PE 1 T A = ρ N eff tranpe 10
11 6.0 Results This section presents modeling results and evaluation of the SIMPil system in future technology. Workload characterization and architecture models are combined with technology parameters to perform detailed projections of system performance and efficiency metrics. An analysis of the design space under power density limitation is also presented. System Performance Important metrics to evaluate the SIMPil system in future technologies include system image resolution, clock frequency, power consumption, and instruction throughput. System image resolution describes the increase in the number of processing elements due to the increasing transistor density. For constant PPE ratio, integrating more processing elements in a single chip results in larger image resolution. Clock frequency and power consumption are interrelated and offer some insights on performance and resource utilization. Average system instruction throughput illustrates the overall performance to execute image-processing applications. Figure 5 shows current and projected system performance metrics. Current system clock rate for the SIMPil system can increase from 50 MHz at 2 Watts power consumption (800 nm) to a projected system clock rate of 1.8 GHz at 50 Watts power consumption (50 nm). The projected performance parameters are subjected to the chosen modeling constraint, which limits clock frequency to the power density limit, in order to ensure implementation feasibility for a given technology scenario. While power consumption does not grow linearly, the increase can be quantified as a rate of 3.2 Watts per year, a rate much less than the current microprocessor power consumption growth rate of 10 Watts per year [12]. Projected image resolutions and instruction throughputs suggest an increasing trend with future technology. Projected image resolution increases from a size of 526 pixels (23 x 23) to a larger size of 850K pixels (920 x 920). Instruction throughput grows from 1.18 Gops/s to more than 70 Tops/s. While the performance growth is not linear, the increase is roughly doubling every year. This upward trend demonstrates the suitability of the SIMPil architecture for GSI technology because while power increases within limits of technology, both performance and image resolution increase to handle larger computation sizes. Technology limits, such as interconnect wiring and power density, do not hinder the performance because SIMPil is an application-specific architecture with short wire interconnects. Image processing applications and algorithms map well to the SIMPil architecture and its SIMD execution model. The SIMPil design sustains higher instruction throughput with more processing elements instead of a more complex, uniprocessor system design. 11
12 Resolution (Pixels) Clock Frequency (MHz) Power (W) Throughtput (Gop/s) Feature Size (nm) Figure 5. SIMPil system performance in GSI technology. System capability in image resolution, clock frequency, power consumption and instruction throughput are illustrated. System Efficiency Power and area efficiency metrics delineate the tradeoffs between instruction throughput and resource utilization. Increasing power efficiency suggests more capability and parallelism in the system. Increasing area efficiency implies better component utilization for given system capabilities. Higher ratings in power and area efficiency metrics are coveted for future image processing systems because of technology limitations such as wire interconnects and power density. Small processing elements with modest capability is desired. Power consumption must be contained to maintain portability, battery life, and effectiveness of the system. Figure 6 illustrates the area and power efficiency ratings of the SIMPil system for current and future technologies. Projected ratings indicate an increasing trend for future technologies. This trend suggests the suitability of the architecture for GSI technology. The positive slope of the trend line shows improvements in both metrics for the SIMPil system. A poor system design that sacrifices area efficiency for power efficiency would have a negatively sloped trend line. This visualization method can be extended to other system implementations to determine the relationships between system efficiency metrics. 12
13 nm 70 nm Area Efficiency (Mop/s.mm2) nm 130 nm 150 nm 180 nm 250 nm 800 nm Power Efficiency (Mop/Joule) Figure 6. SIMPil system efficiency in current and future technologies Power Density Analysis The previous analyses assume the same SIMPil system design for technology projections. This section presents the frequency limit imposed by power density and the impact on system design. The system clock frequency (f c,sys ) is determined by a design factor ( design ) and gate delay (τ gate ). design f c, sys = τ design incorporates design techniques that govern the effective number of gates in the critical path. τ gate changes with technology and is dependent on transistor feature size. With a known f c,sys, the system power dissipation can be determined. It is therefore interesting to determine the power density limitations posed by GSI technology. The power density is expressed in terms of a maximum operating frequency (f c,power ). Beyond this frequency, the power dissipated exceeds the maximum power extractable from the chip. As a design choice for SIMPil, the design factor ( design ) can be varied to determine the maximum value before f c,sys exceeds f c,power. In Figure 7, the frequency at maximum power density (f c,power ) is plotted for different technologies. The shaded area is the region where the system operates within the allowed power density. The power density values that determine the shaded region are obtained from the semiconductor roadmap [11]. A family of f c,sys clock frequency curves is plotted versus feature sizes for different design factor. For any given technology, larger design increases f c,sys, which raises the system power dissipation. Larger design values indicate more aggressive design implementations. For the power density limit to be observed at any given technology, the design factor must be chosen within the shaded area. The optimal design, for each technology, is found at the intersection between f c,sys and f c,power. In 50 nm technology, the SIMPil design can be optimized gate 13
14 by increasing the design factor to from in the current implementation. This increase in design will extend power consumption to the limit set by the semiconductor roadmap. For the SIMPil system, lower frequency and lower power consumption is desired, and the current design does not need to change. The system instruction throughput in excess of 70 Tops/s shown in Figure 5 suggests sufficient processing capability for the image processing workloads at real-time frame rates (30 frames per second). The projected power consumption of 50 Watts for a 52,900 processor system remains below the power density limits in 50 nm technology Power Limit design = design = design = design = design = design = design = Clock Frequency (MHz) design = Feature Size (nm) Figure 7. System clock frequencies and power density limit for SIMPil. The area above the shaded region indicates a region of operation that exceeds projected power density limits. 7.0 Conclusions The SIMD Pixel processor (SIMPil) has been evaluated under a realistic image processing workload, characterized with high concurrency (>70%) and a well-balanced resource utilization. A single die in 50 nm technology provides for a total image resolution of 850K pixels (920 x 920), with a sustained system throughput in excess of 70 Tops/s. System power consumption is contained below 50 Watts for a 52,900 processor system. Moreover, SIMPil design choices are explored, and a more aggressive design is feasible before being limited by power density in 50 nm technology. These projected performance parameters demonstrate the suitability of the SIMPil architecture for GSI technology because performance and image resolution both increase while power consumption remains within technology limits. Future research will include detailed models of wire interconnect and pad resource consumption to offer a more accurate projection. 14
15 8.0 Acknowledgements The work was supported by the Defense Advanced Research Projects Agency (Low Power Electronics Contract: FY ), the National Science Foundation/Georgia Tech Packaging Research Center (Contract: EEC ), AFOSR and ARL. The authors extend thanks to the PICA research group, especially to Mr. Huy H. Cat, Dr. Abelardo López- Lagunas, and Mr. William H. Robinson III. The authors acknowledge the application development activity performed by Dr. José Luis Cruz-Rivera and his research group at the University of Puerto Rico in Mayagüez. 9.0 References [1] G. Baccarini, et al. Generalized Scaling Theory, IEEE Trans. on Electron Devices, pp , April 1984 [2] K. E. Batcher, Design of the Massively Parallel Processor, IEEE Trans. on Computer, C9, v.9, pp , 1980 [3] H. H. Cat, et.al. SIMPil: An OE Integrated SIMD Architecture for Focal Plane Processing Applications, Massively Parallel Processing using Optical Interconnection (MPPOI-96), pp.44-52, 1996 [4] A. P. Chandraskan, et.al. Low-power CMOS digital design. IEEE Journal on Solid-State Circuits,27,pp [5] K. Diefendorff and R. Dubey. How Multimedia Workloads Will Change Processor Design, IEEE Computer, Vol. 30, No. 9, September 1997, pp [6] E. Fossum, Digital Camera System on a Chip, IEEE Micro, pp.8-15, May [7] A. Gentile, et al. Real-Time Image Processing on a Focal Plane SIMD Array, to appear in Proceedings of the Seventh International Workshop on Parallel and Distributed Real-Time Systems, San Juan, Puerto Rico, [8] W. D. Hillis, The Connection Machine, The MIT Press, 1985 [9] J.D. Meindl, Low Power Microelectronics: Retrospect and Prospect, Proceedings IEEE, Vol. 83, No. 4, pp , April [10] J. R. Nickolls, The Design of the MasPar MP-1: A cost-effective Massively Parallel Computer, IEEE Digest of Papers - ComCom, pp.25-28, 1990 [11] The National Technology Roadmap for Semiconductors, Semiconductor Industry Association, [12] V.G. Oklobdzija, Architectural Tradeoffs for Low Power, Intl. Symp. on Computer Architecture, June [13] S.Palacharla, et. al. Complexity-Effective Superscalar Processors, Intl. Symp. on Computer Architecture, 1997, pp [14] SIMPil Home Page, [15] N.H.E. Weste, K. Eshraghian, Principles of CMOS VLSI System Design: A System Perspective, Addison-Wesley, Reading, Massachusetts, [16] D. S. Wills, et al., Processing Architectures for Smart Pixel Systems, IEEE Journal of Selected Topics in Quantum Electronics, v.2 n.1, April 1996, pp
Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationDesign methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts
Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton,
More informationECE 637 Integrated VLSI Circuits. Introduction. Introduction EE141
ECE 637 Integrated VLSI Circuits Introduction EE141 1 Introduction Course Details Instructor Mohab Anis; manis@vlsi.uwaterloo.ca Text Digital Integrated Circuits, Jan Rabaey, Prentice Hall, 2 nd edition
More informationby the vision process. In the course of effecting the preprocessing
CCD FOCAL-PLANE REAL-TIME IMAGE PROCESSOR E-S. Eid and E.R. Fossum Department of Electrical Engineering 1312 S.W. Mudd Building Columbia University New York, New York 10027 ABSTRACT A focal-plane-array
More informationEE241 - Spring 2000 Advanced Digital Integrated Circuits. Practical Information
EE24 - Spring 2000 Advanced Digital Integrated Circuits Tu-Th 2:00 3:30pm 203 McLaughlin Practical Information Instructor: Borivoje Nikolic 570 Cory Hall, 3-9297, bora@eecs.berkeley.edu Office hours: TuTh
More informationEE586 VLSI Design. Partha Pande School of EECS Washington State University
EE586 VLSI Design Partha Pande School of EECS Washington State University pande@eecs.wsu.edu Lecture 1 (Introduction) Why is designing digital ICs different today than it was before? Will it change in
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationIntroduction 1. GENERAL TRENDS. 1. The technology scale down DEEP SUBMICRON CMOS DESIGN
1 Introduction The evolution of integrated circuit (IC) fabrication techniques is a unique fact in the history of modern industry. The improvements in terms of speed, density and cost have kept constant
More informationA Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors
A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,
More informationCAD for VLSI. Debdeep Mukhopadhyay IIT Madras
CAD for VLSI Debdeep Mukhopadhyay IIT Madras Tentative Syllabus Overall perspective of VLSI Design MOS switch and CMOS, MOS based logic design, the CMOS logic styles, Pass Transistors Introduction to Verilog
More informationFundamentals of Computer Design
CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining
More informationAll MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes
MSEE Curriculum All MSEE students are required to take the following two core courses: 3531-571 Linear systems 3531-507 Probability and Random Processes The course requirements for students majoring in
More informationPackage level Interconnect Options
Package level Interconnect Options J.Balachandran,S.Brebels,G.Carchon, W.De Raedt, B.Nauwelaers,E.Beyne imec 2005 SLIP 2005 April 2 3 Sanfrancisco,USA Challenges in Nanometer Era Integration capacity F
More informationThree DIMENSIONAL-CHIPS
IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) ISSN: 2278-2834, ISBN: 2278-8735. Volume 3, Issue 4 (Sep-Oct. 2012), PP 22-27 Three DIMENSIONAL-CHIPS 1 Kumar.Keshamoni, 2 Mr. M. Harikrishna
More informationEECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510)
A V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications Hui Zhang, Vandana Prabhu, Varghese George, Marlene Wan, Martin Benes, Arthur Abnous, and Jan M. Rabaey EECS Dept., University
More informationECE 486/586. Computer Architecture. Lecture # 2
ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:
More informationAn Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling
An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling Keigo Mizotani, Yusuke Hatori, Yusuke Kumura, Masayoshi Takasu, Hiroyuki Chishiro, and Nobuyuki Yamasaki Graduate
More informationA Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm
International Journal of Engineering Research and General Science Volume 3, Issue 4, July-August, 15 ISSN 91-2730 A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm
More informationMicroelettronica. J. M. Rabaey, "Digital integrated circuits: a design perspective" EE141 Microelettronica
Microelettronica J. M. Rabaey, "Digital integrated circuits: a design perspective" Introduction Why is designing digital ICs different today than it was before? Will it change in future? The First Computer
More informationFPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS
FPGA IMPLEMENTATION FOR REAL TIME SOBEL EDGE DETECTOR BLOCK USING 3-LINE BUFFERS 1 RONNIE O. SERFA JUAN, 2 CHAN SU PARK, 3 HI SEOK KIM, 4 HYEONG WOO CHA 1,2,3,4 CheongJu University E-maul: 1 engr_serfs@yahoo.com,
More informationThe Design of the KiloCore Chip
The Design of the KiloCore Chip Aaron Stillmaker*, Brent Bohnenstiehl, Bevan Baas DAC 2017: Design Challenges of New Processor Architectures University of California, Davis VLSI Computation Laboratory
More informationCOE 561 Digital System Design & Synthesis Introduction
1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design
More informationVLSI Digital Signal Processing
VLSI Digital Signal Processing EEC 28 Lecture Bevan M. Baas Tuesday, January 9, 28 Today Administrative items Syllabus and course overview My background Digital signal processing overview Read Programmable
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationECE 5745 Complex Digital ASIC Design Topic 7: Packaging, Power Distribution, Clocking, and I/O
ECE 5745 Complex Digital ASIC Design Topic 7: Packaging, Power Distribution, Clocking, and I/O Christopher Batten School of Electrical and Computer Engineering Cornell University http://www.csl.cornell.edu/courses/ece5745
More informationMore Course Information
More Course Information Labs and lectures are both important Labs: cover more on hands-on design/tool/flow issues Lectures: important in terms of basic concepts and fundamentals Do well in labs Do well
More informationHigh performance, power-efficient DSPs based on the TI C64x
High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research
More informationPower dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.
The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults
More information1 Introduction. 2 Parallel Approaches for Medical Image Registration using SIMD Processor Arrays. 2.1 SIMD Processor Array Architecture
Accelerating Medical Image Registration Using a SIMD Arra I. K. Jeong 1, M. S. Kang 1, C. H. Kim 2 and J. M. Kim 1,* 1 School of Electrical Engineering, Universit of Ulsan, Ulsan, South Korea 2 School
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationThe Impact of Wave Pipelining on Future Interconnect Technologies
The Impact of Wave Pipelining on Future Interconnect Technologies Jeff Davis, Vinita Deodhar, and Ajay Joshi School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332-0250
More informationASSEMBLY LANGUAGE MACHINE ORGANIZATION
ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction
More informationA 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications
A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System
More informationLow Power PLAs. Reginaldo Tavares, Michel Berkelaar, Jochen Jess. Information and Communication Systems Section, Eindhoven University of Technology,
Low Power PLAs Reginaldo Tavares, Michel Berkelaar, Jochen Jess Information and Communication Systems Section, Eindhoven University of Technology, P.O. Box 513, 5600 MB Eindhoven, The Netherlands {regi,michel,jess}@ics.ele.tue.nl
More informationFundamentals of Computer Design
Fundamentals of Computer Design Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationImplementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture
International Journal of Computer Trends and Technology (IJCTT) volume 5 number 5 Nov 2013 Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture
More informationDEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE
DEMYSTIFYING INTEL IVY BRIDGE MICROARCHITECTURE Roger Luis Uy College of Computer Studies, De La Salle University Abstract: Tick-Tock is a model introduced by Intel Corporation in 2006 to show the improvement
More informationDIGITAL IMAGE PROCESSING WRITTEN REPORT ADAPTIVE IMAGE COMPRESSION TECHNIQUES FOR WIRELESS MULTIMEDIA APPLICATIONS
DIGITAL IMAGE PROCESSING WRITTEN REPORT ADAPTIVE IMAGE COMPRESSION TECHNIQUES FOR WIRELESS MULTIMEDIA APPLICATIONS SUBMITTED BY: NAVEEN MATHEW FRANCIS #105249595 INTRODUCTION The advent of new technologies
More informationProcessor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP
Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor
More informationMulti-threading technology and the challenges of meeting performance and power consumption demands for mobile applications
Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits
More informationThe S6000 Family of Processors
The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which
More informationMulti-Core Microprocessor Chips: Motivation & Challenges
Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005
More information6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP
LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationFundamentals of Computers Design
Computer Architecture J. Daniel Garcia Computer Architecture Group. Universidad Carlos III de Madrid Last update: September 8, 2014 Computer Architecture ARCOS Group. 1/45 Introduction 1 Introduction 2
More informationIntroduction to Microprocessor
Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device
More informationVLSI Design Automation. Maurizio Palesi
VLSI Design Automation 1 Outline Technology trends VLSI Design flow (an overview) 2 Outline Technology trends VLSI Design flow (an overview) 3 IC Products Processors CPU, DSP, Controllers Memory chips
More informationBaseline V IRAM Trimedia. Cycles ( x 1000 ) N
CS 252 COMPUTER ARCHITECTURE MAY 2000 An Investigation of the QR Decomposition Algorithm on Parallel Architectures Vito Dai and Brian Limketkai Abstract This paper presents an implementation of a QR decomposition
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationOn GPU Bus Power Reduction with 3D IC Technologies
On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The
More informationIntroduction. Summary. Why computer architecture? Technology trends Cost issues
Introduction 1 Summary Why computer architecture? Technology trends Cost issues 2 1 Computer architecture? Computer Architecture refers to the attributes of a system visible to a programmer (that have
More informationCHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER
84 CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 3.1 INTRODUCTION The introduction of several new asynchronous designs which provides high throughput and low latency is the significance of this chapter. The
More informationMultimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology
Course Presentation Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology Video Coding Correlation in Video Sequence Spatial correlation Similar pixels seem
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture The Computer Revolution Progress in computer technology Underpinned by Moore s Law Makes novel applications
More informationNeuromorphic Computing: Our approach to developing applications using a new model of computing
Neuromorphic Computing: Our approach to developing applications using a new model of computing David J. Mountain Senior Technical Director Advanced Computing Systems Research Program Background Info Outline
More informationEE5780 Advanced VLSI CAD
EE5780 Advanced VLSI CAD Lecture 1 Introduction Zhuo Feng 1.1 Prof. Zhuo Feng Office: EERC 513 Phone: 487-3116 Email: zhuofeng@mtu.edu Class Website http://www.ece.mtu.edu/~zhuofeng/ee5780fall2013.html
More informationLow-Power Technology for Image-Processing LSIs
Low- Technology for Image-Processing LSIs Yoshimi Asada The conventional LSI design assumed power would be supplied uniformly to all parts of an LSI. For a design with multiple supply voltages and a power
More informationFinal Review. Image Processing CSE 166 Lecture 18
Final Review Image Processing CSE 166 Lecture 18 Topics covered Basis vectors Matrix based transforms Wavelet transform Image compression Image watermarking Morphological image processing Segmentation
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationCHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationLinking Layout to Logic Synthesis: A Unification-Based Approach
Linking Layout to Logic Synthesis: A Unification-Based Approach Massoud Pedram Department of EE-Systems University of Southern California Los Angeles, CA February 1998 Outline Introduction Technology and
More informationEmbedded many core sensor-processor system
Efficiency of our computational infrastructure Embedded systems (P-ITEEA_0033) Embedded many core sensor-processor system Lecture 4 2, March, 2016. 22 nm technology 1.2 billion transistors 3.4 GHz clock
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationFlexible wireless communication architectures
Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April
More informationVLSI Design Automation
VLSI Design Automation IC Products Processors CPU, DSP, Controllers Memory chips RAM, ROM, EEPROM Analog Mobile communication, audio/video processing Programmable PLA, FPGA Embedded systems Used in cars,
More informationUnleashing the Power of Embedded DRAM
Copyright 2005 Design And Reuse S.A. All rights reserved. Unleashing the Power of Embedded DRAM by Peter Gillingham, MOSAID Technologies Incorporated Ottawa, Canada Abstract Embedded DRAM technology offers
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationADVANCED FPGA BASED SYSTEM DESIGN. Dr. Tayab Din Memon Lecture 3 & 4
ADVANCED FPGA BASED SYSTEM DESIGN Dr. Tayab Din Memon tayabuddin.memon@faculty.muet.edu.pk Lecture 3 & 4 Books Recommended Books: Text Book: FPGA Based System Design by Wayne Wolf Overview Why VLSI? Moore
More informationVERY LOW POWER MICROPROCESSOR CELL
VERY LOW POWER MICROPROCESSOR CELL Puneet Gulati 1, Praveen Rohilla 2 1, 2 Computer Science, Dronacharya College Of Engineering, Gurgaon, MDU, (India) ABSTRACT We describe the development and test of a
More informationVLSI Design Automation
VLSI Design Automation IC Products Processors CPU, DSP, Controllers Memory chips RAM, ROM, EEPROM Analog Mobile communication, audio/video processing Programmable PLA, FPGA Embedded systems Used in cars,
More informationComputer Organization and Assembly Language
Computer Organization and Assembly Language Week 01 Nouman M Durrani COMPUTER ORGANISATION AND ARCHITECTURE Computer Organization describes the function and design of the various units of digital computers
More informationINTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017
Design of Low Power Adder in ALU Using Flexible Charge Recycling Dynamic Circuit Pallavi Mamidala 1 K. Anil kumar 2 mamidalapallavi@gmail.com 1 anilkumar10436@gmail.com 2 1 Assistant Professor, Dept of
More informationOUTLINE Introduction Power Components Dynamic Power Optimization Conclusions
OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism
More informationTexture. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors
Texture The most fundamental question is: How can we measure texture, i.e., how can we quantitatively distinguish between different textures? Of course it is not enough to look at the intensity of individual
More informationBehavioral Array Mapping into Multiport Memories Targeting Low Power 3
Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,
More informationA Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM
IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 09, 2016 ISSN (online): 2321-0613 A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM Yogit
More informationEECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more
More informationRegister Organization and Raw Hardware. 1 Register Organization for Media Processing
EE482C: Advanced Computer Organization Lecture #7 Stream Processor Architecture Stanford University Thursday, 25 April 2002 Register Organization and Raw Hardware Lecture #7: Thursday, 25 April 2002 Lecturer:
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationA 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing
A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine
More informationReal-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation
LETTER IEICE Electronics Express, Vol.11, No.5, 1 6 Real-time and smooth scalable video streaming system with bitstream extractor intellectual property implementation Liang-Hung Wang 1a), Yi-Mao Hsiao
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationUnderstanding Sources of Inefficiency in General-Purpose Chips
Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex Solomatnikov Benjamin Lee Stephen Richardson Christos Kozyrakis Mark Horowitz GP Processors
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationECE 172A: Introduction to Intelligent Systems: Machine Vision, Fall Midterm Examination
ECE 172A: Introduction to Intelligent Systems: Machine Vision, Fall 2008 October 29, 2008 Notes: Midterm Examination This is a closed book and closed notes examination. Please be precise and to the point.
More informationArtifacts and Textured Region Detection
Artifacts and Textured Region Detection 1 Vishal Bangard ECE 738 - Spring 2003 I. INTRODUCTION A lot of transformations, when applied to images, lead to the development of various artifacts in them. In
More informationCalibrating Achievable Design GSRC Annual Review June 9, 2002
Calibrating Achievable Design GSRC Annual Review June 9, 2002 Wayne Dai, Andrew Kahng, Tsu-Jae King, Wojciech Maly,, Igor Markov, Herman Schmit, Dennis Sylvester DUSD(Labs) Calibrating Achievable Design
More informationDesign and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology
Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,
More informationIncreasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems
Increasing interconnection network connectivity for reducing operator complexity in asynchronous vision systems Valentin Gies and Thierry M. Bernard ENSTA, 32 Bd Victor 75015, Paris, FRANCE, contact@vgies.com,
More informationBy Charvi Dhoot*, Vincent J. Mooney &,
By Charvi Dhoot*, Vincent J. Mooney &, -Shubhajit Roy Chowdhury*, Lap Pui Chau # *International Institute of Information Technology, Hyderabad, India & School of Electrical and Computer Engineering, Georgia
More informationLow Power Set-Associative Cache with Single-Cycle Partial Tag Comparison
Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Jian Chen, Ruihua Peng, Yuzhuo Fu School of Micro-electronics, Shanghai Jiao Tong University, Shanghai 200030, China {chenjian,
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationContinuum Computer Architecture
Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005
More informationParameterized Convolution Filtering in a Field Programmable Gate Array
Parameterized Convolution Filtering in a Field Programmable Gate Array Richard G. Shoup Interval Research Palo Alto, California 94304 Abstract This paper discusses the simple idea of parameterized program
More information