AS THE MOBILE electronics market matures, third-generation

Size: px

Start display at page:

Download "AS THE MOBILE electronics market matures, third-generation"

Everett King
6 years ago
Views:

1 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY A Low-Power 3-D Rendering Engine With Two Texture Units and 29-Mb Embedded DRAM for 3G Multimedia Terminals Ramchan Woo, Student Member, IEEE, Sungdae Choi, Student Member, IEEE, Ju-Ho Sohn, Student Member, IEEE, Seong-Jun Song, Student Member, IEEE, Young-Don Bae, Student Member, IEEE, and Hoi-Jun Yoo, Member, IEEE Abstract A low-power three-dimensional (3-D) rendering engine with two texture units and 29-Mb embedded DRAM is designed and integrated into an LSI for mobile third generation (3G) multimedia terminals. Bilinear MIPMAP texture-mapped 3-D graphics can be realized with the help of low-power pipeline structure, optimization of datapath, extensive clock gating, texture address alignment, and the distributed activation of embedded DRAM. The scalable performance reaches up to 100 Mpixels/s and 400 Mtexels/s at 50 MHz. The chip is implemented with m pure DRAM process to reduce the fabrication cost of the embedded-dram chip. The logic with DRAM takes 46 mm 2 and consumes 140 mw at 33-MHz operation, respectively. The 3-D graphics images are successfully demonstrated by using the fabricated chip on the prototype PDA board. Index Terms Embedded DRAM, low power, mobile application, PDA, portable, texture mapping, 3-D graphics rendering. TABLE I PIPELINE DESCRIPTION I. INTRODUCTION AS THE MOBILE electronics market matures, third-generation (3G) multimedia terminals such as PDAs or smart cellphones are gaining popularity. Their applications are already migrating to real-time multimedia, even to the three-dimensional (3-D) gaming applications [1]. Therefore, much research about hardware accelerators [2] [4] and software-only solutions [1], [5] has tried to put 3-D graphics rendering into the handheld devices. However, they are still below the market requirements showing only limited shading operations, without the texture mapping which is a mandatory requirement for 3-D gaming applications. In order to draw texture-mapped 3-D graphics on the mobile terminals, huge memory bandwidth and capacity must be provided to store the frame, depth, and texture images. Therefore, the embedded memory logic (EML) process is one of the most promising solutions since it integrates both DRAM and logic on a single die. However, this EML technology costs too much because the logic must be designed with the different transistors from the DRAM [11]. Therefore, it has been seldom used on the low-cost mobile platforms. Manuscript received October 28, 2003; revised January 15, The authors are with the Semiconductor System Laboratory, Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon , Korea ( ural@eeinfo.kaist.ac.kr; hjyoo@ee.kaist.ac.kr). Digital Object Identifier /JSSC In this work, we designed and implemented a 3-D rendering engine using the pure DRAM technology to reduce the fabrication cost while maintaining the huge memory bandwidth. Using the DRAM process enables us to further reduce the power consumption because off-chip loading to the rendering memory is completely eliminated. We optimize the circuits and architectures so that the rendering engine with two texture units and 29-Mb embedded DRAM is realized while satisfying the requirements of the long-lasting battery lifetime and the physical dimensions of mobile terminals. Also, we designed the rendering engine as a scalable IP core to satisfy the performance requirements on various mobile platforms within allowed power budget, since the target applications range from simple avatars, user interfaces, and commercials on the QCIF ( ) display to the real-time 3-D games on the QVGA ( ). This paper is organized as follows. The system architecture will be discussed in Section II, and the design of low-power rendering pipeline will be covered in Section III. Energy /04$ IEEE

2 1102 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004 Fig. 1. Example of target 3G system. Fig. 2. Rendering engine architecture. efficient texture unit and embedded DRAM architecture will follow in Section IV and V, respectively. After discussing the implementation results in Section VI, we will summarize the work in Section VII. II. SYSTEM ARCHITECTURE Fig. 1 shows the target 3G system which contains a baseband modem for communication, an application processor dedicated for multimedia processing, and memories. Once 3-D objects and texture data are downloaded from the air channel, they are stored inside the system memory and graphics DRAM, respectively. Then, the rendering engine starts drawing of 3-D image pixels onto the LCD screen. The system architecture of the proposed rendering engine [6] is shown in Fig. 2. It consists of a main pixel pipeline, a post-processing unit, and 12 rendering DRAMs. The main pixel pipeline performs shading and texturing with two pixel processors, each of which contains a high-performance texture unit. After the pixel is being processed in the main pipeline, the postprocessing unit recalculates the pixel data for real-time special rendering effects such as antialiasing, motion blur, and fog [7]. The 29-Mb rendering DRAMs contain frame buffers, depth buffers, and texture memories. Twelve independently controlled DRAMs reduce the power consumption since only the necessary memories can be activated selectively. III. LOW-POWER RENDERING PIPELINE Fig. 3 shows the main rendering pipeline attached with graphics memories and Table I describes its operation. It is composed of 14 multipipelined stages to maximally save the power consumption by activating only the necessary stages. The graphics memories are accessed through distributed pipeline stages depth buffer at PI stage, texture memory at TP2 stage, and frame buffer at PB stage. Since each pipeline stage is designed as a module with its own controller, additional rendering features can be easily inserted in the next revision without modifying the entire pipeline. After fetching the instructions, the rendering engine shapes the triangle and varies the operation cycles in the next stages according to the size (HOLD#1) and the shape (HOLD#2) of the triangles by pausing the previous pipeline stages. Shaping the triangle is accelerated in the TS stage, performing the horizontal-order rasterization (scanline-based rasterization) as in Fig. 4. Although this rasterization can simplify the memory addressing and pipeline control, the rendering performance can be degraded when the triangle falls across the DRAM pages in the conventional DRAM architecture [8], [13]. Therefore, we redefined the timing of graphics DRAM and assigned the frame and depth buffers as a vertical stripe pattern, instead of prefetching data from standard SDRAM. Since the row of proposed DRAM can be changed without any latency at 50-MHz random row cycle

WOO et al.: RENDERING ENGINE FOR 3G MULTIMEDIA TERMINALS 1103 Fig. 3. Main rendering pipeline. Fig. 4. Rasterization order and frame/depth buffer assignment. ns and each memory (A or B in Fig.

This rasterization order also reduces the power consumption since the memories corresponding to only the necessary pixels can be activated.

3 WOO et al.: RENDERING ENGINE FOR 3G MULTIMEDIA TERMINALS 1103 Fig. 3. Main rendering pipeline. Fig. 4. Rasterization order and frame/depth buffer assignment. ns and each memory (A or B in Fig. 4) has its own read/write ports, the graphics DRAM can continuously provide the bandwidth required to access two pixels together. This rasterization order also reduces the power consumption since the memories corresponding to only the necessary pixels can be activated. To render triangles with modified Bresenham s incremental line drawing algorithm [15], the position of input vertices must be identified, and the increments of colors and coordinates must be calculated in the earlier rendering pipeline TS stage. The total calculation time from the register to the final multiplexer (MUX) in the TS is less than 20 ns and it decides the maximum operation frequency of the rendering engine 50 MHz. In order to develop applications quickly in the mobile 3-D graphics, the model data may be shrunk from the PC platform, where triangles are optimized for large-sized screen resolution ( , , or more), to mobile platforms which has even smaller sized screen resolution ( or ). Therefore, the average number of pixels inside the triangle can be smaller in mobile 3-D, which means setup time may become bottleneck of pixel throughput. The setup engine is designed to ensure the triangle-setup cycle is always smaller than pixels-filling cycle even for a single-pixel triangle one cycle triangle setup without latency. Here, optimizing the datapath width is important to implement the TS with small number of transistor gates, while preserving the necessary precision. In this implementation, we use 11-bit floating-point bit mantissa bit exponent SIMD dividers for the datapath. Although the shifters at the last stage in the floating-point look-up table (LUT) division increases the gate counts by 14%, the total area of SIMD dividers is smaller than that of 16-bit fixed-point LUT divider by 40% since the area of the multiplier is much reduced. In order to execute the rendering programs and to control the datapath, bit encoded instructions are defined. Since the transferring the vertices takes most of the rendering cycle, the instructions are optimized for this operation. As shown in Fig. 5, the length of instruction is selected to be 128-bit fixed-format to transfer whole vertex information at every single rendering cycle. Therefore, colors (,,, ), screen coordinates, screen depth, and homogeneous texture coordinates are transferred together with the command information. This 128-bit instructions require

However, it means this rendering engine is attachable to any other geometry engine by changing the design of glue-logic, without touching the rendering core.

4 1104 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004 Fig. 5. Instruction set format. Fig. 6. Extensive clock gating. additional glue-logic to adapt to the 32-bit geometry engine in the graphics LSI [9]. However, it means this rendering engine is attachable to any other geometry engine by changing the design of glue-logic, without touching the rendering core. The number of instruction is decided to support the subset of OpenGL rendering operations, discarding high-level functions and buffers which can be rarely used in the mobile gaming applications. Additional instructions to support real-time special rendering effects, to control the embedded DRAMs, and to manage the standby power are also defined. Since the rendering engine contains two pixel processors (PPs) and each PP has its own texture unit fetching 4 texels/cycle, the pixel fill rate and the texel rate are up to 100 Mpixels/s and 400 Mtexels/s at 50 MHz, respectively. The two pixel processors are also simply assigned to render horizontally adjacent pixels. So, it is easy to gather texture address, and this can be used to propose the energy-efficient texture unit covered in the next section. In order to eliminate the power consumption of the unused blocks as much as possible, we applied extensive clock gating to the pipeline latches as shown in Fig. 6. The rendering engine suspends the following pipeline by gating off the clocks in each pixel processor according to the results of the depth comparison Fig. 7. Bilinear texture filtering. in the PI stage. Therefore, we place the depth-compare unit in the earlier pixel stage, unlike the case in the high-performance PC graphics chipsets. Although this violates the OpenGL semantics, which do not allow updating the depth buffer until after texture mapping as textured pixels may be completely transparent, this violation can be solved by removing those triangles in the software prior to the rendering operation. Also, the pipeline latches of the shading and texturing unit can be independently enabled or disabled to maximally avoid the unnecessary datapath transition.

5 WOO et al.: RENDERING ENGINE FOR 3G MULTIMEDIA TERMINALS 1105 Fig. 8. Address alignment logic. IV. ENERGY-EFFICIENT TEXTURE UNIT The texture images are mapped from the texel space to the screen space as shown in Fig. 7. During this operation, bilinear MIPMAP filtering is performed to improve the pixel quality further [12]. However, this filtering generates as many as eight texture memory requests to process two pixels together since four texels are necessary for the calculation of one pixel. However, fetching 8 texels directly from eight texture memories may result in huge power consumption. Therefore, we propose address alignment logic (AAL) to combine the texel requests and reduce them in real-time. Fig. 8 shows the block diagram of AAL. After texture addresses ( and ) are calculated at TA1 stage, four bilinear addresses are generated from each pixel processor. Then, the spatial aligner (TA2_SPATIAL_ALIGN) compares the texture addresses of PP0 (PP0UV0 PP0UV3) with those of PP1 (PP1UV0 PP1UV3), setting the overlapped position flag (OPF) on SA0 SA3. Then, the temporal aligner (TA2_TEM- PORAL_ALIGN) compares the current texture requests (PP0UV0 PP0UV3, PP1UV0 PP1UV3) with the previous ones which are stored inside the registers, setting the OPF on TA0 TA7. The mask generation block (TA2_MASK_GEN) finally merges the OPF from the spatial and temporal aligners

6 1106 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004 Fig. 9. AAL analysis results. (a) Test vectors. (b) Remaining requests after AAL. (c) Number of texture requests. (d) Number of cycles. and generates the bit-masks (SPmask, TMmask), which indicate the texel positions to be newly fetched from the texture memories. The simulation results show the average numbers of mask bits are 5 for SPmask and 2.3 for TMmask. At the same time, the texture addresses are translated into the physical address, which covers 24-Mb memory space at the TA2_ADDR_TRANSLATION block. Although the average number of texture memories activated per cycle is reduced to 2.3 through the operation of spatial and temporal aligner, the maximum number is still eight. Since the rendering engine is attached to four texture memories in this implementation, the bank access is scheduled by TP1_BANK_AGGREGATION in a round-robin manner. We choose the number of texture memories attached to texturing unit as four, since the cumulative probability of remaining requests after AAL are about 90% within this number. The use of AAL also can make the number of texture memories be even smaller for the cheaper platforms if we can sacrifice the performance. When the same texture bank is accessed, this block sets TP1_MULTI to 1, extending the operation cycles. Then TP2 and TP3 stage redistribute the texel data from four texture memories to eight corresponding positions, feeding 4 texels per PP for bilinear texture filtering. Although the number of texture prefetch stages (TP1, TP2, and TP3) is optimized to 3 for this implementation, in which the latency of texture DRAM is 1, it can be easily scaled up for multilatency DRAM such as off-chip texture memory by simply inserting more pipeline latches at TP2. Fig. 9 shows the AAL analysis results. We simulated the performance of AAL while running test vectors as in Fig. 9(a). Fig. 9(b) shows the probability of remaining number of texture requests after AAL. Fig. 9(c) shows the number of texture requests to draw two pixels together. The spatial aligner and temporal aligner reduce the requests to 2.3 on average. The number of cycles to draw two pixels is illustrated in Fig. 9(d). Although two pixels are processed together, the number of cycles is increased by only 10%. Therefore, this rendering engine can draw two pixels while requiring less number of activation

7 WOO et al.: RENDERING ENGINE FOR 3G MULTIMEDIA TERMINALS 1107 Fig. 10. Power/energy reduction in embedded DRAM. (a) Power consumption. (b) Energy consumption. of texture memories with little cycle overhead compared to a single-pp architecture, which means the rendering engine needs less energy to finish drawing a scene. V. EMBEDDED DRAM ARCHITECTURE To save the power consumption of the embedded DRAMs as well as to optimally utilize their bandwidth, we designed three different DRAM types: frame buffer, depth buffer, and texture memory. In order to satisfy the cycle and latency requirements of rendering logic, we completely redesigned the DRAMs, without using any SRAM caches, which consume extra power. To cover the screen resolution which matches the screen resolution of most of current cell phones, four frame buffers and four depth buffers with zero latency are used in the chip. Also, four texture memories amount to 24 Mb and store MIPMAP texture image for the 3-D gaming applications. These embedded DRAMs can operate at scalable clock frequency ranging from 5 to 50 MHz to match the speed of the rendering logic, providing up to 2.4 GB/s bandwidth with 416-bit-wide bus. Twelve distributed DRAMs also save run-time power consumption since only the necessary memories can be selectively activated out of twelve. In this architecture, the overall power of rendering memories per two-pixel can be written as follows: PP1 utilization depth-gated ratio texture-access ratio. Here, depends on the size and the shape of triangle, and it tends to decrease when the triangle gets smaller. depends on the depth complexity, and it can be reduced by the extensive clock gating according to the depth-comparison results. is reduced by the AAL. Based on the actual amount of power consumption of each DRAM ( mw, mw, and mw, measured at 33 MHz), the can be illustrated as in Fig. 10(a). More power can be saved as the triangles get smaller and scenes get more complex, which can happen for gaming applications on small-sized LCD screen of mobile devices. When,, and, the power can be reduced by 65%, compared with the unified memory architecture where all memories are activated together. Fig. 10(b) shows the normalized energy consumption until finishing the drawing job. Let total number of pixels to be drawn. Then, the time required to finish the drawing is Therefore, the energy consumption to finish the drawing is where,, and are the power consumption of the frame buffer, depth buffer, and texture memory, respectively, and The distributed memory system saves more energy as 3-D applications get more complex 63% reduction for and. Also, the memories can be selectively refreshed for data retention in standby modes by power-control instructions as shown in Fig. 11: PLHD (Hold), PIDL (Idle), PSLP (Sleep), and POFF (Off). PHLD can be used to hold datapath and memory temporally for normal rendering operations, waiting for geometry operation. All memories are refreshed in this mode. PIDL turns off the rendering clock but refreshes all graphics memories. In PSLP mode, only texture memory is refreshed to hold the texture images since they are possibly

1108 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004 TABLE II RENDERING ENGINE FEATURES Fig. 11. Standby power models. Fig. 13. Prototype PDA board. Fig. 12. Die photograph.

IMPLEMENTATION The 3-D rendering engine with embedded DRAM is integrated into the Graphics LSI which contains a 32-bit RISC processor and power management unit as well [9], [10].

12 shows the die photograph and Table II summarizes its features. It can draw 24-bit texture-mapped pixels with maximum drawing speed of 100 Mpixels/s and 400 Mtexels/s at 50 MHz.

8 1108 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 7, JULY 2004 TABLE II RENDERING ENGINE FEATURES Fig. 11. Standby power models. Fig. 13. Prototype PDA board. Fig. 12. Die photograph. downloaded from the wireless network. Finally, POFF turns off all operations. VI. IMPLEMENTATION The 3-D rendering engine with embedded DRAM is integrated into the Graphics LSI which contains a 32-bit RISC processor and power management unit as well [9], [10]. The chip is fabricated using m 256-Mb-compatible DRAM process to implement both the logic and memory into a single chip with low fabrication cost. Fig. 12 shows the die photograph and Table II summarizes its features. It can draw 24-bit texture-mapped pixels with maximum drawing speed of 100 Mpixels/s and 400 Mtexels/s at 50 MHz. The use of AAL with four TMs reduces the sustained texturing performance by only about 10%. This is about 50 times faster than the minimum performance requirement (2 Mpixels/s for avatar animation at 15 f/s) of PDAs and cellphones with QVGA resolution LCD screens. Therefore, the clock speed of this rendering engine can be decided to scale down the performance also with the power consumption, depending on the target applications and platforms. The first silicon is successfully working and real-time 3-D graphics images are demonstrated on the prototype PDA board as shown in Fig. 13. VII. CONCLUSION A low-power 3-D rendering engine for 3G multimedia terminals is designed and implemented. Integrating the embedded DRAM and applying various low-power techniques such as extensive clock gating, address alignment, and distributed memories reduce its power consumption to less than 140 mw at the continuous drawing of texture-mapped 3-D scenes. This scalable core with 29-Mb DRAM can operate at various frequencies up to 50 MHz to satisfy the performance and power requirements of different application processors. The rendering engine is integrated into the Graphics LSI, fabricated with m DRAM process, and 3-D animations are successfully demonstrated on the prototype system. REFERENCES [1] Khronos Group, Bringing 3-D gaming to cell phones, presented at the Game Developers Conf [2] R. Woo et al., A 120-mW 3-D rendering engine with 6-Mb embedded DRAM and 3.2-Gbyte/s runtime reconfigurable bus for PDA chip, IEEE J. Solid-State Circuits, vol. 37, pp , Oct

WOO et al.: RENDERING ENGINE FOR 3G MULTIMEDIA TERMINALS 1109 [3] C.-W. Yoon et al.

[4] Y.-H. Park et al., A 7.1-GB/s low-power rendering engine in 2-D arrayembedded memory logic CMOS for portable multimedia system, IEEE J. Solid-State Circuits, vol. 36, pp. 944 955, June 2001.

, A low power 3-D rendering engine with two texture units and 29 Mb embedded DRAM for 3G multimedia terminals, in Proc. Eur. Solid-State Circuits Conf. (ESSCIRC), 2003, pp. 53 56. [7] T.

ACM SIGGRAPH, 1994, pp. 167 173. [9] R. Woo et al., A 210 mw graphics LSI implementing full 3-D pipeline with 264 Mtexels/s texturing for mobile multimedia applications, in IEEE ISSCC Dig. Tech.

D. Buss, Technology in the Internet age, in IEEE ISSCC Dig. Tech. Papers, Feb. 2002, pp. 18 21. [12] L. Williams, Pyramidal parametrics, in Proc. ACM SIGGRAPH, 1983, pp. 1 11. [13] J. Montrun and H.

, Accurate rendering by subpixel addressing, IEEE Comput. Graphics Applicat., pp. 45 52, Se

9 WOO et al.: RENDERING ENGINE FOR 3G MULTIMEDIA TERMINALS 1109 [3] C.-W. Yoon et al., A 80/20-MHz 160-mW multimedia processor integrated with embedded DRAM, MPEG-4 and 3-D rendering engine for mobile applications, IEEE J. Solid-State Circuits, vol. 36, pp , Nov [4] Y.-H. Park et al., A 7.1-GB/s low-power rendering engine in 2-D arrayembedded memory logic CMOS for portable multimedia system, IEEE J. Solid-State Circuits, vol. 36, pp , June [5] G. K. Kolli, 3-D graphics optimizations for ARM architecture, presented at the Game Developers Conf [6] R. Woo et al., A low power 3-D rendering engine with two texture units and 29 Mb embedded DRAM for 3G multimedia terminals, in Proc. Eur. Solid-State Circuits Conf. (ESSCIRC), 2003, pp [7] T. Akenine-Moller et al., Real-Time Rendering, 2nd ed. Wellesley, MA: A. K. Peters, [8] M. F. Deering et al., FBRAM: a new form of memory optimized for 3-D graphics, in Proc. ACM SIGGRAPH, 1994, pp [9] R. Woo et al., A 210 mw graphics LSI implementing full 3-D pipeline with 264 Mtexels/s texturing for mobile multimedia applications, in IEEE ISSCC Dig. Tech. Papers, Feb. 2003, pp [10] R. Woo et al., A low-power and high-performance 2D/3D graphics accelerator for mobile multimedia applications, presented at the Hot Chips Conf [11] D. D. Buss, Technology in the Internet age, in IEEE ISSCC Dig. Tech. Papers, Feb. 2002, pp [12] L. Williams, Pyramidal parametrics, in Proc. ACM SIGGRAPH, 1983, pp [13] J. Montrun and H. Moreton, nvidia GeForce4, presented at the Hot Chips Conf [14] OpenGL (2003) [Online]. Available: [15] O. Lathrop and D. Kirk et al., Accurate rendering by subpixel addressing, IEEE Comput. Graphics Applicat., pp , Sept Ramchan Woo (S 00) received the B.S. (summa cum laude) and M.S. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST) in 1999 and 2001, respectively. He is currently working toward the Ph.D. degree in electrical engineering at KAIST and expected to graduate in Aug As a Chief Researcher at the Semiconductor System Laboratory in KAIST, he developed the full 3-D graphics LSI for handheld devices. His research interests include low-power design of mobile multimedia system with specific interest in mobile 3-D computer graphics architecture and its implementation with merged-dram technology. Also, he is now working for the mobile graphics libraries. Sungdae Choi (S 01) was born on March 17, 1978, in Korea. He received the B.S and M.S. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, in 2001 and 2003, respectively, where he is currently working toward the Ph.D. degree. In 2001, he joined the Semiconductor System Laboratory (SSL) at KAIST as a Research Assistant. His research activities are related to application-specific embedded memory architecture and content-addressable memories. Ju-Ho Sohn (S 01) was born on July 7, 1979, in Korea. He received the B.S. (summa cum laude) and M.S. degrees in electrical engineering from the Korea Advanced Institude of Science and Technology (KAIST), Daejeon, in 2001 and 2003, respectively. He is currently working toward the Ph.D. degree in electrical engineeing in the same department. His research activities are related to real-time 3-D graphics for portable systems and its implementation, especially high-performance portable multimedia processor design for 3-D vertex operations. Seong-Jun Song (S 01) was born in Seoul, Korea, in He received the B.S. degree in electrical engineering and computer science in 2001 from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, where he is currently working toward the M.S. degree. Since 2001, he has been a Research Assistant at KAIST. His research interests include high-speed optical interface integrated circuits using submicron CMOS technology, phase-locked loops, and clock and data recovery circuits for high-speed data communications, and radio-frequency CMOS integrated circuits for wireless communication applications. Young-Don Bae (S 01) received the B.S. and M.S. degrees in electronics engineering from Chungnam National University, Daejeon, Korea, in 1997 and 1999, respectively. Currently, he is working toward the Ph.D. degree in the Department of Electrical Engineering and Computer Science at the Korea Advanced Institute of Science and Technology (KAIST), Daejeon. His research interests include system-on-a-chip design methodology and high-performance and low-power microprocessor design. Hoi-Jun Yoo (M 95) graduated from the Electronic Department of Seoul National University, Seoul, Korea, in 1983 and received the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, in 1985 and 1988, respectively. His Ph.D. work concerned the fabrication process for GaAs vertical optoelectronic integrated circuits. From 1988 to 1990, he was with Bell Communications Research, Red Bank, NJ, where he invented the two-dimensional phase-locked VCSEL array, the front-surface-emitting laser, and the high-speed lateral HBT. In 1991, he became Manager of a DRAM design group at Hyundai Electronics and designed a family of fast-1 M DRAMs and synchronous DRAMs, including 256 M SDRAM. From 1995 to 1997, he was a faculty member with Kangwon National University. In 1998, he joined the faculty of the Department of Electrical Engineering at KAIST, and currently leads a project team on RAM Processors (RAMP). In 2001, he founded a national research center, System Integration and IP Authoring Research Center (SIPAC), funded by Korean government to promote wordwide IP authoring and its SOC application. Currently he is the Project Manager for SoC in Korea Ministry of Information and Communication. His current interests are SOC design, IP authoring, high-speed and low-power memory circuits and architectures, design of embedded memory logic, optoelectronic integrated circuits, and novel devices and circuits. He is the author of the books DRAM Design (Seoul, Korea: Hongleung, 1996; in Korean) and High Performance DRAM (Seoul, Korea: Sigma, 1999; in Korean). Dr. Yoo received the Electronic Industrial Association of Korea Award for his contribution to DRAM technology in 1994 and the Korea Semiconductor Industry Association Award in 2002.

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,