Impact of Power Density Limitation in Gigascale Integration for the SIMD Pixel Processor

Size: px

Start display at page:

Download "Impact of Power Density Limitation in Gigascale Integration for the SIMD Pixel Processor"

Adam Jackson
5 years ago
Views:

1 Impact of Power Density Limitation in Gigascale Integration for the SIMD Pixel Processor Sek M. Chai, Antonio Gentile, D. Scott Wills School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia {sek, gentile, Abstract Gigascale Integration (GSI) enables a new generation of monolithic focal plane processing systems built with billion-transistor chips. As this technology matures, fundamental technology limitations on wire interconnects and power dissipation will become the performance bottleneck. This paper presents system performance projections for GSI technologies under these constraints. Architectural models and workload characterization are integrated to identify viable future system implementations. The SIMD Pixel processor (SIMPil) is selected as the architecture for evaluation, and an image processing application suite is programmed to characterize the workload. Projections for SIMPil systems show that over three orders of magnitude improvement is achievable by 2012 in both system throughput and image resolution. System power consumption is contained below 50 Watts for a 52,900 processor system in 50 nm technology. The SIMPil architecture design space is explored, and opportunities for more aggressive designs within power density limits are examined. 1.0 Introduction The National Technology Roadmap for Semiconductors (NTRS) projects a two-billiontransistor monolithic chip by 2012 [11]. At Gigascale transistor-density levels, the power consumption of a chip can easily extend to a level beyond its heat extraction or battery supply capabilities. If performance were expected to double every year, power dissipation, which is growing at approximately 10 Watts per year for general-purpose microprocessors, can increase beyond power savings gained by technology [9][12]. To improve the balance between power consumption and performance, architecture and application must be studied together to find the full system impact on targeted domains. This paper evaluates the SIMD Pixel processor (SIMPil), a fine-grain architecture for focal plane image processing. An application suite covering different aspects of image processing (image compression, filtering, and analysis) is programmed for SIMPil [7]. Applications are simulated to extract average workload characteristics, instruction histograms, and system concurrency to determine functional unit utilization under realistic operating conditions. This approach provides an accurate estimation of power dissipation and efficiency of each functional unit. 1

2 VLSI layout information are extracted from the current design and expressed in terms of silicon area, transistor count, and total capacitance for each functional unit. This information forms the system implementation description for analysis. Technology parameters are extracted from [11] and used to define the technology scenarios. Combining system and technology descriptions, system performance parameters such as instruction throughput, image resolution, area consumption, and power dissipation are projected. Clock frequency in a technology scenario is chosen in agreement to the limit imposed by power density, instead of using clock frequency values as projected in the roadmap. SIMPil is an embedded architecture that is a good candidate for focal plane image processing. Unlike current general-purpose microprocessors, the architecture reduces datapath complexity by specializing for image processing domains. Interprocessor communication paths are near-neighbor to maintain short wire lengths. In comparison, new architectural features in existing general-purpose microprocessors are offering diminishing returns as complexity and long broadcast wires become design bottlenecks [13]. Focal plane image processing applications are stream-oriented, and large caches in general-purpose microprocessors are not efficiently used because every stream element is read exactly once [5]. Many different SIMD systems have been proposed [2][8][10], which offer the required I/O and computational throughput to handle image processing applications. However, their performance and generality come at the expense of I/O coupling, power consumption, and portability. As the goal of this paper is to determine performance parameters of SIMPil in future technology, only a single system is evaluated. This paper will show that SIMPil can maintain the balance in performance within the power density limits imposed by technology scenarios projected by [11]. The rest of the paper is organized as follows. Section 2 describes the architecture of the SIMPil system being developed at Georgia Tech. Section 3 presents a table of symbols and definitions. Section 4 resents a profile of the image processing applications implemented on SIMPil and the workload characteristics. Section 5 introduces the modeling effort incorporated in a Technology Scenario Analyzer (TeSA) tool to project system parameters for different technologies using semiconductor roadmap projections. Section 6 presents results and evaluation. Conclusions are offered in Section SIMPil System Architecture The SIMD Pixel Processor (SIMPil) is a focal plane image processing system which employs area-array I/O to access directly to the processors. The SIMPil design explores the benefits of integrating an image sensor array with a high-performance multiprocessorcomputing plane. This monolithic integration of image sensors and digital processing elements is the key-feature of the SIMPil system. In SIMPil, the image stream flows directly from the focal plane into the processing plane, retaining its spatial correlation, as depicted in Figure 1. 2

3 ACU Figure 1: The SIMPil system. Image streams are optically focussed into the sensor array, and hence mapped onto the processing engine in a single operation. The SIMPil architecture consists of a mesh of SIMD processors. A block diagram for a 16- bit implementation is illustrated in Figure 2. The instruction set architecture allows a single processing element (PE) to address a 4 4 array of image sensors. Each processor incorporates an analog to digital converter to convert light intensities, incident on the sensors, into digital values. The SAMPLE instruction simultaneously collects all sensor values and makes them available for further processing. Each processing element is a simplified RISC processor that contains the following functional units (FU): 16 bit ALU with adder/subtractor and barrel shifter; Multiply-accumulator unit with a 32 bit accumulator register; 16 three-ported general purpose and special registers; 64 words of local memory (256 maximum words); Communication and serial I/O units; Masking unit to control PE activity. Neighboring PEs PE Communication Unit Arithmetic, Logical, and Shift Unit Register File 16 by 16 bit 2 read, 1 write Multiply Accumulator Image Sensor Subarray ADC Local Memory (64 words) Processor Array Special Registers & I/O Decoder Figure 2: Block diagram of a 16-bit implementation of a SIMPil PE. Each PE is directly interfaced to a small array of image sensors. PE's are connected together via a NEWS mesh. 3

SIMPil PEs are connected through a NEWS network. Any entry in the register file can be used as source or destination in a communication instruction.

4 SIMPil PEs are connected through a NEWS network. Any entry in the register file can be used as source or destination in a communication instruction. In addition, constant data can also be received (or transmitted) serially through a specialized serial I/O unit. Data reception or transmission occurs without interrupting the normal PE operation. All instructions execute in a single cycle. Figure 3: Symbolic layout of a SIMPil16 prototype. The chip measures mm 2, and it packs about 38,590 transistors. It is fabricated in HP 0.8 µm CMOS process and housed in a 132-pin PGA. Early prototyping efforts have proved the feasibility of direct coupling of a simple processing core with a sensor device [3]. A 16 bit prototype of a SIMPil PE was designed in 0.8 µm CMOS process and fabricated through MOSIS. The prototypes were successfully tested and run at 25 MHz. The symbolic layout of the prototype PE is shown in Figure 3. The prototype PE measures mm 2, and contains a total of 38,590 transistors. SIMPil functional units are specified in Table 1, in terms of silicon area and transistor number. A single PE is estimated to consume about 44.1 mw at 5 V, running at 25 MHz, over the entire application workload. 4

5 Table 1: SIMPil FUs specifications for the 16-bit implementation. Functional Units Area (mm 2 ) Number of Transistors MACC ,844 MEMORY ,098 REGFILE ,974 COMM UNIT SERIAL I/O ,006 ALU ,620 BARREL SHIFTER ,118 SLEEP UNIT DECODER BUS DRIVER Large arrays of SIMPil PEs can be simulated using the SIMPil Simulator [14]. This software tool is an instruction level simulator, running under Windows95. Applications for the SIMPil system can be edited, assembled, executed, and debugged within this single integrated workbench. Metering facilities are also built in the simulator to determine the concurrency level, memory usage, and instruction histograms during execution. 3.0 Glossary Table 2. List of symbols and their definitions A eff Effective die area n gate Logic gates in critical path A max Maximum die area * N tranpe Number of transistors per PE A pad Total pad area N pe Total number of PEs A wire Total wiring area η power Power efficiency metric α Transistor activity factor η area Area efficiency metric C o Output load capacitance P clk Power from clock distribution C Htree Total capacitance in H-Tree P eff Effective total power C w Wiring capacitance P max Maximum power dissipation* design Design factor P pad Power dissipated in pads ε o Permittivity in vacuum PPE Pixels to Processor ratio ε r Dielectric permittivity* Res System image resolution in pixels E i Effective energy consumption ρ tran Maximum transistor density* f c Operating clock frequency S Scaling factor f c,power System clock frequency τ gate Single gate delay* f c,power Power-limited clock frequency V Minimum logic Vdd* IPC Instruction per cycle W i Workload factor I T Avg system instruction throughput W clk Clock wire width L pe Dimension of PE U System Utilization N FU Number of functional units *Indicated technology values obtained from [11]. Other values are derived, modeled, or calculated. 5

6 4.0 Workload Characterization The SIMPil architecture is designed for image and video processing applications. In general, this class of applications is computationally intensive and requires high throughput to handle the massive data flow in real-time. However, these applications offer a large degree of data parallelism, which is not usually exploited by sequential image processing systems. SIMPil combines focal plane image acquisition with a SIMD execution model to exploit available data parallelism and remove I/O bottleneck. Image frames are available simultaneously at each PE in the system, and their spatial correlation is retained. To evaluate the set of architectural design choices implemented in the SIMPil system, the following image-processing applications have been implemented and simulated using the SIMPil16 Simulator. Details on the implementations are offered elsewhere [3][7]. Spatial filtering. The implementation performs 2D convolution-based filtering. Operations such as shadowing, edge detection, and smoothing are executed using appropriate 3 3-filter masks. Discrete Fourier transform. 2D Discrete Fourier Transform has been implemented using a matrix multiplication algorithm. The original image is transformed row first then columns. The weight matrices are preloaded into the system, and they are rearranged to support the nearestneighbor communication scheme available on SIMPil. Fixed-point arithmetic is used to implement the algorithm. Morphological filtering. Basic morphological operations (erosion, dilation) have been implemented using a 3 3 structuring element. These operations are implemented as intersection and union of shifted versions of the original image. More complex operations, such as opening, closing, inside edge detection, and skeletonization are then implemented by combining the two basic operations. Wavelet decomposition. Discrete wavelet decomposition has been implemented for fingerprint compression and archival. Standard Daubechie's filters have been used to implement the low/high pass filters. A row-column scheme decomposes a gray-level image into 61 frequency bands. Image rotation. A parallel rotation algorithm has been implemented to perform fast rotations of binary images. The rotation angle γ is first expressed as π π γ = α + n, α 0,, and = n. 2 Rotations are then are executed in two stages: a skew-based rotation of the angle α, and then a set of n fast ninety-degree rotations. This scheme is well suited for a SIMD implementation with regular communication patterns. Image labeling. This implementation is based on a cluster analysis algorithm. It is used to classify objects in a binary image on the basis of object diameter. The objects are then labeled accordingly. Quadtree region representation. This implementation operates on binary images to generate a quadtree representation. Quadtrees are based on the principle of recursive decomposition of space. The image is first decomposed in four equal-sized quadrants. If a quadrant is not uniform (entirely filled/empty), it is further decomposed in four more subquadrants. The 6

7 decomposition stops when uniform quadrants are encountered, or the quadrant contains a single pixel. Region identification. In this implementation, a small region of interest is identified using chromatic information. Several stages are executed to complete the task, including binarization, quadtree generation, region isolation, and region zooming. Larger applications, such as JPEG image encoding, and region clustering are currently being implemented, integrating various components into larger applications. The above applications were simulated and the instruction histograms were generated. As this paper focuses on the design of SIMPil PEs, scalar instructions have been excluded from the analysis. The instructions executed in each PE have been divided among the different functional units, and the results are listed in Table 3, along with the average system utilization. Table 3: Workload characterization. Average system and functional unit utilizations are given for each application. Only instructions executed in the PE are considered to compute the utilization of each FU. Applications System Functional Units Utilization (%) Utilization (%) ALU MACC SHIFT MEM COMM MASK PIXEL IED SKL LBL WLT QTREE SKEW RING SF DFT REGION IED: Inside Edge Detection SKL: Skeletonization LBL: Image Labeling WLT: Wavelet Decomposition QTREE: Quad Tree Decomposition ROT Skew: Skew-based Rotation ROT Ring: 90 Ring Rotation SF: Spatial Filtering DFT: Discrete Fourier Transform REGION: Region Identification This application set characterizes a typical workload for the SIMPil architecture. Two elements in particular will be considered in the architecture models discussed in the next section: the system utilization (U), and the workload factor (W i ) for a SIMPil PE. These values are averaged over the entire set of applications and are listed in Table 4. This characterization is done on a per cycle basis because the power analysis is a rate measurement of energy consumption. SIMPil performances over the workload are detailed elsewhere [7]. Table 4: Average system utilization and workload factors for a SIMPil PE. System Workload Factors (W i ) Utilization (U) ALU MACC SHIFT MEM COMM MASK PIXEL 71.61% 33.60% 3.43% 5.04% 28.46% 14.18% 14.85% 0.44% 7

8 5.0 Architecture Modeling A TEchnology Scenario Analyzer (TeSA) tool has been built to project future system performance. TeSA incorporates application characteristics, such as system utilization (U) and workload factor (W i ), with architectural and technology models. Architectural models are defined by VLSI layout information and expressed in terms of silicon area, transistor count, and total capacitance for each functional unit. Technology parameters are extracted from semiconductor roadmap and used to define the technology scenarios. This section presents salient features of TeSA. Power and area reduction factors are described along with capacitance calculation and technology scaling. System sizes are calculated from transistor densities. Selected performance parameters such as clock frequency, power dissipation, system pixel resolution, and sustained throughput are determined. Clock Frequency Model SIMPil is evaluated in terms of power efficiency (η power ) and area efficiency (η area ) metrics by considering throughput per power consumed (Mops/Joule) and throughput per silicon area consumed (Mops/s mm 2 ). The following equations illustrate these metrics: η power I T f = P eff c η area IT f = A The efficiency metrics are functions of instruction throughput, clock frequency, and resource cost such as power and area. The system instruction throughput (I T ) is calculated from the average concurrency of the system (U), the single PE instruction throughput (IPC), and the total number of processing elements (N pe ). P eff is effective power calculated from maximum system power (P max ) reduced by power consumed from pad and clock distribution. A eff is effective silicon area consumed, and it is calculated from maximum die size (A max ) reduced by area consumed for pads, bus wiring, and inter-node routing. The system clock frequency (f c,sys ) is determined from the critical path gate depth (n gate ), and a single gate delay (τ gate ). This value does not account for the limit posed by the maximum power dissipated from a chip by a heat sink. A maximum clock frequency (f c,power ) can be calculated from the maximum power density for SIMPil. f c,power is a function of the application workload factor (W i ) and the effective energy consumption (E i ). f c,sys and f c,power are described by the following equations: f c, sys = n gate 1 τ gate In TeSA, the operating clock frequency (f c ) is set as: ( f f ) f c = min c, sys, c, power eff P fc power = max, N PE c N FU i 1 EiWi This approach ensures that the operating clock frequency is below the upper bound set by power density limits. As a design choice for SIMPil, the clock frequency is not set as the maximum frequency possible in a given technology, but below a value set by power density. 8

9 Effective Area and Power Models TeSA includes the effects of area and power consumptions due to I/O pads and wiring interconnects. The effective area and power available for the system are calculated with the following equations. A eff = A max Apad Awire P eff = P max P pad P clk Area consumed by I/O pads (A pad ) is determined as a percentage of total available area (A max ). A 0.8 µm output pad area is used as a baseline, and the appropriate percentage reduction for future technology is applied. Area consumed from internal wiring (A wire ) is also calculated as percentage reductions, with the 0.8 µm implementation as a baseline. Power dissipated in I/O pads (P pad ) is determined as a percentage of total available power (P max ). H-SPICE simulations are used to determine power for a 0.8 µm output pad through a 64-pin PGA. This value is used as a baseline to calculate P pad for future technologies. Power dissipated in distributing the clock (P clk ) can be a large portion of the power budget. For the SIMPil system, a H-Tree clock distribution scheme [15] is used, as illustrated in Figure 4. The H-Tree provides a well-balanced signal propagation scheme for clock distribution. The signal paths to the next H-Tree level are equal in length. Line drivers are scaled for each level of the H-Tree proportionally to the signal path length. Total output capacitance is calculated including wire and output loads. Line Driver PE dimension 2nd level Htree 1st level Htree Figure 4. H-Tree clock distribution scheme for multi-node SIMPil System. Scaled line drivers and wire lengths are calculated in terms of PE dimensions and system size. The following equations illustrate the calculation of total capacitance and power dissipation for the entire H-Tree. The total capacitance, C Htree, is the aggregate capacitance for each H-Tree level. log CHtree = 4 N PE 1 ε ε i C 2 + on PE r owclklpe i= 0 i 2 The number of H-Tree levels is given by log 4 N PE 1. For each H-Tree level, a capacitance is calculated as the sum of two terms. The first term is dependent on signal wire length. The second term is total output capacitance from the line drivers. Power for clock distribution will subsequently increase substantially with advancing technology as the number of processing elements (N PE ) increases. P clk is given by. P clk = C Htree V 2 f c 9

10 Capacitance Scaling and Energy Consumption Models A SIMPil processing element is divided into the following functional units: ALU, multiply accumulate unit (MACC), barrel shifter, register file, on-chip memory, communication unit, instruction decoder, sleep unit, and bus drivers. For each unit, the load capacitance (C o ) and wire capacitance (C w ) are extracted from the implemented 0.8 µm design using the MAGIC VLSI layout tool kit and H-SPICE. TeSA adopts two different scaling methodologies for transistor load capacitance and wires to account for different scaling properties of wire interconnect and transistor drain/gate capacitances. The following equations describe the load and wire capacitance scaling. C ε 1 ' ' r ' w = C C w o = Co ε r s s In the above equation, the tick marks indicate values in future technology. The wire capacitance scales with the improvements in permittivity as well as reduced wire length with smaller feature sizes. Because a SIMPil processing element communicates only with its neighbors through near-neighbor interconnection network, global communication wires are ignored. Output load capacitance scales with the feature-size scaling factor (S) [1]. Effective energy consumption during transistor switching (E i ) is calculated with the following equation. E ( C C ) V 2 i = α w + A transistor activity (α) is assumed for every functional unit. The application workload utilization (W i ) is used to determine the activity workload of each functional unit. Groups of functional units that are active during different instruction types are formed. Active functional units contribute to energy consumption during the operating cycle. For example, an ALU operation requires the ALU, register file, and bus drivers to be active. In comparison, a LOAD operation requires the memory, register file, and bus drivers to be active. In each instruction group, the energy terms E i of each functional unit are summed, each in proportion to the activity of that unit. These sums are used to determine the operating clock frequency (f c ) described earlier. Pixel Resolution and System Size TeSA calculates the number of processing elements directly from a given technology s transistor density. From the effective die area (A eff ), the total number of transistors per monolithic chip is determined. This total transistor count is divided by the transistor count per processing element to determine the number of processing elements per chip (N pe ). This approach can provide a better approximation of system size than area scaling because area scaling for future technology may violate transistor density. The transistor density represents the maximum number of transistors in any given silicon area. Wiring area is considered by reducing the effective die area (A eff ) before calculation with transistor density. Pixel resolution (Res) for future SIMPil system is calculated with a pixel to processor, PPE, ratio. The following equations illustrate the models to calculate pixel resolution and system size. Res = NPE PPE o N PE 1 T A = ρ N eff tranpe 10

11 6.0 Results This section presents modeling results and evaluation of the SIMPil system in future technology. Workload characterization and architecture models are combined with technology parameters to perform detailed projections of system performance and efficiency metrics. An analysis of the design space under power density limitation is also presented. System Performance Important metrics to evaluate the SIMPil system in future technologies include system image resolution, clock frequency, power consumption, and instruction throughput. System image resolution describes the increase in the number of processing elements due to the increasing transistor density. For constant PPE ratio, integrating more processing elements in a single chip results in larger image resolution. Clock frequency and power consumption are interrelated and offer some insights on performance and resource utilization. Average system instruction throughput illustrates the overall performance to execute image-processing applications. Figure 5 shows current and projected system performance metrics. Current system clock rate for the SIMPil system can increase from 50 MHz at 2 Watts power consumption (800 nm) to a projected system clock rate of 1.8 GHz at 50 Watts power consumption (50 nm). The projected performance parameters are subjected to the chosen modeling constraint, which limits clock frequency to the power density limit, in order to ensure implementation feasibility for a given technology scenario. While power consumption does not grow linearly, the increase can be quantified as a rate of 3.2 Watts per year, a rate much less than the current microprocessor power consumption growth rate of 10 Watts per year [12]. Projected image resolutions and instruction throughputs suggest an increasing trend with future technology. Projected image resolution increases from a size of 526 pixels (23 x 23) to a larger size of 850K pixels (920 x 920). Instruction throughput grows from 1.18 Gops/s to more than 70 Tops/s. While the performance growth is not linear, the increase is roughly doubling every year. This upward trend demonstrates the suitability of the SIMPil architecture for GSI technology because while power increases within limits of technology, both performance and image resolution increase to handle larger computation sizes. Technology limits, such as interconnect wiring and power density, do not hinder the performance because SIMPil is an application-specific architecture with short wire interconnects. Image processing applications and algorithms map well to the SIMPil architecture and its SIMD execution model. The SIMPil design sustains higher instruction throughput with more processing elements instead of a more complex, uniprocessor system design. 11

12 Resolution (Pixels) Clock Frequency (MHz) Power (W) Throughtput (Gop/s) Feature Size (nm) Figure 5. SIMPil system performance in GSI technology. System capability in image resolution, clock frequency, power consumption and instruction throughput are illustrated. System Efficiency Power and area efficiency metrics delineate the tradeoffs between instruction throughput and resource utilization. Increasing power efficiency suggests more capability and parallelism in the system. Increasing area efficiency implies better component utilization for given system capabilities. Higher ratings in power and area efficiency metrics are coveted for future image processing systems because of technology limitations such as wire interconnects and power density. Small processing elements with modest capability is desired. Power consumption must be contained to maintain portability, battery life, and effectiveness of the system. Figure 6 illustrates the area and power efficiency ratings of the SIMPil system for current and future technologies. Projected ratings indicate an increasing trend for future technologies. This trend suggests the suitability of the architecture for GSI technology. The positive slope of the trend line shows improvements in both metrics for the SIMPil system. A poor system design that sacrifices area efficiency for power efficiency would have a negatively sloped trend line. This visualization method can be extended to other system implementations to determine the relationships between system efficiency metrics. 12

13 nm 70 nm Area Efficiency (Mop/s.mm2) nm 130 nm 150 nm 180 nm 250 nm 800 nm Power Efficiency (Mop/Joule) Figure 6. SIMPil system efficiency in current and future technologies Power Density Analysis The previous analyses assume the same SIMPil system design for technology projections. This section presents the frequency limit imposed by power density and the impact on system design. The system clock frequency (f c,sys ) is determined by a design factor ( design ) and gate delay (τ gate ). design f c, sys = τ design incorporates design techniques that govern the effective number of gates in the critical path. τ gate changes with technology and is dependent on transistor feature size. With a known f c,sys, the system power dissipation can be determined. It is therefore interesting to determine the power density limitations posed by GSI technology. The power density is expressed in terms of a maximum operating frequency (f c,power ). Beyond this frequency, the power dissipated exceeds the maximum power extractable from the chip. As a design choice for SIMPil, the design factor ( design ) can be varied to determine the maximum value before f c,sys exceeds f c,power. In Figure 7, the frequency at maximum power density (f c,power ) is plotted for different technologies. The shaded area is the region where the system operates within the allowed power density. The power density values that determine the shaded region are obtained from the semiconductor roadmap [11]. A family of f c,sys clock frequency curves is plotted versus feature sizes for different design factor. For any given technology, larger design increases f c,sys, which raises the system power dissipation. Larger design values indicate more aggressive design implementations. For the power density limit to be observed at any given technology, the design factor must be chosen within the shaded area. The optimal design, for each technology, is found at the intersection between f c,sys and f c,power. In 50 nm technology, the SIMPil design can be optimized gate 13

14 by increasing the design factor to from in the current implementation. This increase in design will extend power consumption to the limit set by the semiconductor roadmap. For the SIMPil system, lower frequency and lower power consumption is desired, and the current design does not need to change. The system instruction throughput in excess of 70 Tops/s shown in Figure 5 suggests sufficient processing capability for the image processing workloads at real-time frame rates (30 frames per second). The projected power consumption of 50 Watts for a 52,900 processor system remains below the power density limits in 50 nm technology Power Limit design = design = design = design = design = design = design = Clock Frequency (MHz) design = Feature Size (nm) Figure 7. System clock frequencies and power density limit for SIMPil. The area above the shaded region indicates a region of operation that exceeds projected power density limits. 7.0 Conclusions The SIMD Pixel processor (SIMPil) has been evaluated under a realistic image processing workload, characterized with high concurrency (>70%) and a well-balanced resource utilization. A single die in 50 nm technology provides for a total image resolution of 850K pixels (920 x 920), with a sustained system throughput in excess of 70 Tops/s. System power consumption is contained below 50 Watts for a 52,900 processor system. Moreover, SIMPil design choices are explored, and a more aggressive design is feasible before being limited by power density in 50 nm technology. These projected performance parameters demonstrate the suitability of the SIMPil architecture for GSI technology because performance and image resolution both increase while power consumption remains within technology limits. Future research will include detailed models of wire interconnect and pad resource consumption to offer a more accurate projection. 14

15 8.0 Acknowledgements The work was supported by the Defense Advanced Research Projects Agency (Low Power Electronics Contract: FY ), the National Science Foundation/Georgia Tech Packaging Research Center (Contract: EEC ), AFOSR and ARL. The authors extend thanks to the PICA research group, especially to Mr. Huy H. Cat, Dr. Abelardo López- Lagunas, and Mr. William H. Robinson III. The authors acknowledge the application development activity performed by Dr. José Luis Cruz-Rivera and his research group at the University of Puerto Rico in Mayagüez. 9.0 References [1] G. Baccarini, et al. Generalized Scaling Theory, IEEE Trans. on Electron Devices, pp , April 1984 [2] K. E. Batcher, Design of the Massively Parallel Processor, IEEE Trans. on Computer, C9, v.9, pp , 1980 [3] H. H. Cat, et.al. SIMPil: An OE Integrated SIMD Architecture for Focal Plane Processing Applications, Massively Parallel Processing using Optical Interconnection (MPPOI-96), pp.44-52, 1996 [4] A. P. Chandraskan, et.al. Low-power CMOS digital design. IEEE Journal on Solid-State Circuits,27,pp [5] K. Diefendorff and R. Dubey. How Multimedia Workloads Will Change Processor Design, IEEE Computer, Vol. 30, No. 9, September 1997, pp [6] E. Fossum, Digital Camera System on a Chip, IEEE Micro, pp.8-15, May [7] A. Gentile, et al. Real-Time Image Processing on a Focal Plane SIMD Array, to appear in Proceedings of the Seventh International Workshop on Parallel and Distributed Real-Time Systems, San Juan, Puerto Rico, [8] W. D. Hillis, The Connection Machine, The MIT Press, 1985 [9] J.D. Meindl, Low Power Microelectronics: Retrospect and Prospect, Proceedings IEEE, Vol. 83, No. 4, pp , April [10] J. R. Nickolls, The Design of the MasPar MP-1: A cost-effective Massively Parallel Computer, IEEE Digest of Papers - ComCom, pp.25-28, 1990 [11] The National Technology Roadmap for Semiconductors, Semiconductor Industry Association, [12] V.G. Oklobdzija, Architectural Tradeoffs for Low Power, Intl. Symp. on Computer Architecture, June [13] S.Palacharla, et. al. Complexity-Effective Superscalar Processors, Intl. Symp. on Computer Architecture, 1997, pp [14] SIMPil Home Page, [15] N.H.E. Weste, K. Eshraghian, Principles of CMOS VLSI System Design: A System Perspective, Addison-Wesley, Reading, Massachusetts, [16] D. S. Wills, et al., Processing Architectures for Smart Pixel Systems, IEEE Journal of Selected Topics in Quantum Electronics, v.2 n.1, April 1996, pp

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,