Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Similar documents
An Optimal Algorithm for Prufer Codes *

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

A Binarization Algorithm specialized on Document Images and Photos

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Meta-heuristics for Multidimensional Knapsack Problems

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Module Management Tool in Software Development Organizations

The Codesign Challenge

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Conditional Speculative Decimal Addition*

Load Balancing for Hex-Cell Interconnection Network

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Parallel matrix-vector multiplication

Support Vector Machines

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Mathematics 256 a course in differential equations for engineering students

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Simulation Based Analysis of FAST TCP using OMNET++

Efficient Distributed File System (EDFS)

Dynamic Voltage Scaling of Supply and Body Bias Exploiting Software Runtime Distribution

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Parallel Branch and Bound Algorithm - A comparison between serial, OpenMP and MPI implementations

ARTICLE IN PRESS. Signal Processing: Image Communication

Petri Net Based Software Dependability Engineering

Reducing Frame Rate for Object Tracking

The Research of Support Vector Machine in Agricultural Data Classification

CHAPTER 2 PROPOSED IMPROVED PARTICLE SWARM OPTIMIZATION

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Wishing you all a Total Quality New Year!

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

High-Level Power Modeling of CPLDs and FPGAs

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Virtual Machine Migration based on Trust Measurement of Computer Node

An Entropy-Based Approach to Integrated Information Needs Assessment

FPGA Based Fixed Width 4 4, 6 6, 8 8 and Bit Multipliers using Spartan-3AN

A Robust Method for Estimating the Fundamental Matrix

High-Boost Mesh Filtering for 3-D Shape Enhancement

Feature Reduction and Selection

Classifier Selection Based on Data Complexity Measures *

Array transposition in CUDA shared memory

Research of Dynamic Access to Cloud Database Based on Improved Pheromone Algorithm

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Edge Detection in Noisy Images Using the Support Vector Machines

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits

Smoothing Spline ANOVA for variable screening

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

CMPS 10 Introduction to Computer Science Lecture Notes

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

An Image Fusion Approach Based on Segmentation Region

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

An Investigation into Server Parameter Selection for Hierarchical Fixed Priority Pre-emptive Systems

S1 Note. Basis functions.

AADL : about scheduling analysis

Active Contours/Snakes

Channel 0. Channel 1 Channel 2. Channel 3 Channel 4. Channel 5 Channel 6 Channel 7

A RECONFIGURABLE ARCHITECTURE FOR MULTI-GIGABIT SPEED CONTENT-BASED ROUTING. James Moscola, Young H. Cho, John W. Lockwood

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

CHAPTER 4 PARALLEL PREFIX ADDER

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

An efficient iterative source routing algorithm

Routing in Degree-constrained FSO Mesh Networks

Video Proxy System for a Large-scale VOD System (DINA)

Optimal Scheduling of Capture Times in a Multiple Capture Imaging System

Evaluation of Parallel Processing Systems through Queuing Model

OPTIMAL VIDEO SUMMARY GENERATION AND ENCODING. (ICIP Draft v0.2, )

A Modified Median Filter for the Removal of Impulse Noise Based on the Support Vector Machines

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

LECTURE : MANIFOLD LEARNING

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

A fast algorithm for color image segmentation

TN348: Openlab Module - Colocalization

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

Load-Balanced Anycast Routing

A Saturation Binary Neural Network for Crossbar Switching Problem

Multi-objective Virtual Machine Placement for Load Balancing

Improving Low Density Parity Check Codes Over the Erasure Channel. The Nelder Mead Downhill Simplex Method. Scott Stransky

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

Two-Stage Data Distribution for Distributed Surveillance Video Processing with Hybrid Storage Architecture

Cluster Analysis of Electrical Behavior

Simple March Tests for PSF Detection in RAM

Mallathahally, Bangalore, India 1 2

High resolution 3D Tau-p transform by matching pursuit Weiping Cao* and Warren S. Ross, Shearwater GeoServices

Combined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors Λ

Related-Mode Attacks on CTR Encryption Mode

ELEC 377 Operating Systems. Week 6 Class 3

IP Camera Configuration Software Instruction Manual

Transcription:

Confguraton Management n Mult-Context Reconfgurable Systems for Smultaneous Performance and Power Optmzatons* Rafael Maestre, Mlagros Fernandez Departamento de Arqutectura de Computadores y Automátca Unversdad Complutense de Madrd, SPAIN http://www.dacya.ucm.es/maestre Fad J. Kurdah, Nader Bagherzadeh, Hartej Sngh Department of Electrcal and Computer Engneerng Unversty of Calforna, Irvne, USA Abstract In ths paper, we present a novel soluton to the problem of confguraton management for mult-context reconfgurable systems targetng DSP applcatons, ts goal beng to mnmze both, confguraton latency and power consumpton. We assume that ths technque s appled wthn a larger complaton framework, whch provdes a scheduled task sequence of the consdered applcaton. Reconfguraton latency reducton s the frst crtera to consder, and we prove that the optmal soluton can be obtaned n all cases. Secondly, power s optmzed wthout affectng performance. The assumptons of the method are supported by the analyss of a mathematcal model, and ts effectveness s demonstrated by some experments.. Introducton Reconfgurable computng combnes a reconfgurable hardware processng unt wth a programmable processor. The processor orchestrates the whole system and t may be also used for computatons so as to contrbute to the overall throughput. The reconfgurable hardware mplements the core of computatons. Ths knd of computng s consoldatng as a vable desgn alternatve to mplement a wde range of computatonally ntensve applcatons. Its man advantage s clearly ts versatlty. These systems can be adapted to almost any target applcaton, achevng an acceptable degree of performance n many cases. Most reconfgurable systems explot ts whole processng potental when executng applcatons wth a lot of nherent parallelsm. DSP and multmeda applcatons are some of those applcatons. Ths s why our work s focused on them. Addtonally, ths knd of applcatons have an nternal structure wth some features, such as regularty, that may be exploted n order to smplfy the problem formulaton. The executon of such complex applcatons requres mplementng consecutvely dfferent confguratons durng computaton, whch s called run-tme reconfguraton. The term dynamc reconfguraton restrcts ths concept for the cases such that the reconfguraton overhead does not mply an essental computaton stall. As the technology has produced devces wth shorter reconfguraton tmes, the dea of dynamc reconfguraton has become a vable possblty. We could say that dynamc reconfguraton permts the consderaton of a new dmenson n the desgn space, tme multplexng of hardware resources. Instead of ncreasng the number of avalable resources, the confguratons are swapped n and out of the actual hardware, as they are needed. Dynamc reconfguraton may be mplemented n two dfferent confguraton memory styles. One of them allows selectve access to the reconfguraton memory n a partcular regon of the reconfgurable devce. At the same tme the rest of the devce remans fully operatonal. Xlnx XC6 famly s one example of ths choce. In ths knd of devces the porton of the hardware that s beng reconfgured cannot be used. Recently, mult-context archtectures have appeared to facltate and accelerate the confguraton swtch []. They can store a set of dfferent contexts (confguratons) n a context memory. When a new confguraton s needed, t s loaded from the context memory f avalable. As the context memory s on-chp, ths operaton s much faster than the reconfguraton from an external memory. For example, MorphoSys archtecture [] has a context memory that can store 3 dfferent contexts, and a full change of context only takes one clock cycle, compared to one sngle context transfer from the external memory whch takes more than cycles. However, the context memory cannot store all the confguratons of a specfc applcaton. Therefore, t s obvous that reconfguraton tme may be dramatcally reduced, but only f the context memory s carefully * Ths work has been granted by Spansh Government Grant CICYT TIC 99-474.

managed. We address the problem for mult-context archtectures wth an undefned number of contexts. Besdes performance, the mnmzaton of power consumpton s another very mportant ssue that should be consdered, at least for portable systems. Power n CMOS ntegrated crcuts s manly caused by the swtchng actvty of nternal capactances. Ths means that power consumpton can be reduced f we fnd a way to decrease the swtchng actvty as much as possble n the whole chp. Although ths knd of optmzaton can be performed at dfferent complaton steps, we address ths problem from the confguraton schedulng and allocaton pont of vew; therefore optmzatons are physcally restrcted to the context (confguraton) memory. As typcal applcatons are usually so demandng, performance s the frst optmzaton factor to consder so that tmng constrants are met. Addtonally, the optmal soluton for confguraton mnmzes the number of confguraton swaps, whch smultaneously produces a postve effect on power consumpton. Then, the solutons wth the same latency are explored to acheve further power mprovements. Ths work assumes that the dfferent tasks that compose the target applcaton have already been scheduled. Ths task schedulng can be obtaned through the approach to schedulng n reconfgurable computng presented n []. It s the frst step of a whole desgn development framework [3]. Ths schedulng technque provdes the best task executon order, although the effect of the followng tasks s only estmated. The current work s a part of the second step of ths framework, and ts goal s to accomplsh the degree of performance that the schedulng task assumed, whle optmzng power consumpton. Reconfgurable computng s an emergng area of computng that has opened many dfferent research felds. For example, schedulng s an area that, although has been wdely nvestgated n other felds, needs to ncorporate the characterstcs of ths knd of systems. As a matter of fact, most approaches are versons of exstng hgh-level synthess technques, extended to consder specfc features of reconfgurable systems, such as the reconfguraton tme [4]-[7]. More especally, context schedulng s a relatvely new problem that has receved lttle attenton. To the best of our knowledge, [3] and [8] are the only work that explctly tackles ths ssue. [8] addresses reconfguraton overhead reducton through confguraton prefetch. However, ts applcablty s restrcted to sngle context archtectures. In [3] a heurstc soluton to the current problem for mult-context archtectures s presented. Other schedulng methodologes n reconfgurable computng, lke [,], do not explctly address ths ssue. Fnally, we wll menton that we have not found any prevous work about power optmzatons for mult-context systems. The paper s organzed nto 6 sectons. Secton ntroduces the relevant ssues, and provdes a frst approach to the problem. Secton 3 analyzes the mnmzaton of context loadng, whch mproves both performance and power consumpton. Secton 4 s devoted to the allocaton of memory replacements, whose goal t to acheve the lowest power consumpton for the optmal performance schedulng. Then, secton 5 summarzes the expermental results that have been obtaned. Fnally, secton 6 concludes the paper.. Problem Overvew A typcal DSP or multmeda applcaton s composed of a sequence of macro-tasks that are repeatedly executed as a loop. We use the term kernel to refer to one of those well-defned macro-tasks []. Some examples of those applcatons are MPEG (vdeo compresson and decompresson), JPEG (mage compresson) and ATR (Automatc Target Recognton), whereas DCT (Dscrete Cosne Transform) or ME (Moton Estmaton) may exemplfy some of these kernels. Hence, n ths paper we wll assume that any kernel sequence s perodc. Moreover, as the kernels of those applcatons can be scheduled through the algorthm presented n [], the resultng kernel schedulng consttutes the nput to our problem. In order to provde an ntutve approach to the problem, we wll consder the examples shown n Fg.. The fgure represents the status of the CM (Context Memory) by means of some CM snapshots taken just before the begnnng of a kernel executon. We have represented one sngle nner teraton of the perodc sequence. A whole kernel executon mples the mplementaton of a specfc sequence of contexts, whch wll be assumed to be n the CM durng ts correspondng kernel executon. As a kernel always uses the same contexts, t may be possble to keep some of them n the memory, and reuse them n the next kernel executon. We wll use the term statc to refer to those contexts. On the other hand, the rest of the context wll have to be loaded before the begnnng of the next executon, and they wll be stored n the remanng part of the CM. These contexts wll be called dynamc contexts. Thus, dynamc contexts mply loadng tme, whereas statc contexts save loadng tme, and hence we should maxmze the statc part of the CM. For example Fg. llustrates two solutons wth dfferent statc szes, and T (total number of context loadngs) s bgger for the example wth less statc contexts (Fg..a). In subsequent sectons we wll prove that the statc part s maxmzed, and thus T s mnmzed, when the numbers of loadngs for all kernels (all ) are as equal as possble, but frst we wll provde a bref dea here.

Statc Dynamc Begnnng Imagne one of those solutons, lke the one presented n Fg..a. If were bgger for at least one kernel, ths would reduce the sze of the statc part, and therefore some statc contexts would become dynamc and should be later reloaded. Ths fact would certanly ncrease the total number of loadngs. Fg..b represents a soluton such that substantally dfferent for every kernel. As shown the soluton s worse. In ths example the crtera to buld ths soluton has been that the number of loadngs for a kernel s proportonal to ts context sze. Smlarly, f context loadngs are accomplshed too soon, they wll occupy CM postons, whch wll reduce the number of statc postons. Therefore, context loadngs have to be performed as late as possble (AAP). The mnmzaton of context loadngs optmzes reconfguraton latency, but at the same tme t potentally mproves the power consumpton, snce the total number of transtons from to of the nternal capactance assocated to ndvdual CM cells s kept wthn low values. Once the soluton wth the lowest latency has been generated, power can be further optmzed. So far we have not consdered context allocaton, whch wll drectly nfluence power. The number of bt changes that mples the consecutve loadng of two confguratons n the same CM poston determnes the actual power consumpton, and there are many dfferent confguraton replacements for a gven schedule. For example, n Fg..a K contexts can replace any of K 3 or K 4 contexts. The exploraton of the alternatves wll lead to dfferent power numbers. Sequence of kernels: { K,K,K 3,K 4 } Sze of contexts: { C = 4, C = 9, C 3 = 3, C 4 = 6 } a. of K of K of K 3 of K 4 b. K K K K K 7 = K 3 K Begnnng K 7 K 3 K Begnnng K 7 K 3 K 3 = 3 = 4 =6 Executon flow Begnnng K 7 K 3 K 3 6 K 4 6 Number of context loadngs for the optmal soluton: { =, =, 3 =, 4 = 6 } Ÿ T =4 3 Statc Dynamc Begnnng of K All the prevous deas are analyzed n detal n the followng sectons. 3. Analyss of Context oadng Mnmzaton In ths secton we present the mathematcal analyss that leads us to the optmal soluton n terms of number of context loadngs, whch optmzes performance, besdes reducng power consumpton. The reasonng s based on the fact that the sze of the CM (SCM) s a physcal constrant that has to be fulflled n all cases. It s obvous that the number of both, dynamc and statc contexts cannot exceed SCM. We suggest the reader to use Fg. n order to vsualze the meanng of the expressons. The dynamc part of the CM (Dyn) s gven by the maxmum number of context loadngs ( ): Dyn Begnnng of K Executon flow MAX. () On the other hand, the statc part of the CM (Stc) s composed of the contexts that are not loaded every teraton (C - ), thus f C s the number of contexts for kernel K, C Stc. Begnnng of K 3 Begnnng of K 4 K 3 K 3 K 3 K 3 K 4 K 4 K 4 K 4 5 K 3 6 K 3 6 K 3 6 K 3 6 K 4 K 4 K 4 K 4 K 3 6 K 3 K K 5 K 3 7 K 3 3 7 K 4 4 = =5 3 =7 4 =4 Number of context loadngs for a non-optmal soluton: { =, = 5, 3 = 7, 4 = 4 } Ÿ T =47 Fg.. CM snapshots for two possble context loadng schedules. The snapshots have been taken just at the begnnng of each kernel executon, and show the content of the CM postons. K stands for the kernel. represents the number of loadngs regardng kernel K, and T s the total number of context loadngs. C s the number of contexts for kernel K. The trangles depct the loaded contexts. Therefore, the total sze of the context memory s: 3 ()

Begnnng of K K K K K 3 Begnnng of K K K K K Begnnng of K 3 K K K 3 K 3 3 Begnnng of K K K K K =4 =4 3 = =4 Fg.. Snapshots of the CM dynamc part for a possble schedulng showng some hypothetcal allocaton of contexts of 4 bts. The dotted arrow ndcates the number of bt changes for a context allocaton. C SCM t Dyn Stc MAX. From (3) we can express the total number of loadngs as: T t MAX C SCM. Ths expresson proves the dea ntroduced n the prevous secton. If s bgger for a partcular kernel, K, than j for the others, ths one wll determne the lowest feasble value of T. Hence, f for all kernels take the same value and expresson (4) s fulflled, ths wll be the optmal soluton n terms of reconfguraton latency. It should be noted that as the number of context loadngs for a kernel cannot exceed ts number of contexts (that s C t ), all have to be at least as equal as possble. For example n Fg..a as C 4 t 4 the maxmum value for s 4. However, the rest of the kernels can fulfll the requrements, and thus t mnmzes the total number of loadngs, T. Thus, for the example n Fg..a: T MAX 4 t MAX C SCM,,,6 6 3 If any s ncreased T wll also be ncreased, and hence the example n Fg..a s the optmal soluton. In order to obtan the best dstrbuton, we wll make the assumpton that all the kernels may have equal values of, A =, whch means that MAX( ) = A, and T = *, where s the total number of kernels. Consequently, from (4): (3) (4) (5) C SCM. (6) N T If some kernel/s cannot satsfy ths expresson, we wll choose the lowest C and ts wll take the hghest possble value, as n the example of Fg..a. Then, ths kernel wll not be consdered n the next computaton of equaton (6). Ths process s repeated untl a feasble soluton s obtaned. As the reader may magne, the complexty of ths algorthm s very low, and t can easly handle sets wth a bg number of kernels, generatng the optmal soluton n very short tmes. 4. Allocaton for Power Optmzaton The soluton presented n secton 3 mnmzes the total number of loadngs, and consequently not only optmzes reconfguraton latency, but also reduces the number of context replacements, whch mproves power consumpton. At the level of abstracton we are workng on, power can only be reduced by mnmzng the number of bt level transtons of the CM cells. Thus, we have to explore context allocaton alternatves, n order to fnd the confguraton replacements wth the lowest number of bt changes. Fg. llustrates the allocaton of a bt specfcaton of dynamc contexts for a small example of three kernels and a CM wth contexts of 4 bts. The only way to evaluate the power consumpton that a specfc replacement mples s to count the number of bt changes between contexts. At frst sght, t may seem that the number of combnatons s too hgh, snce we should obtan the bt changes between all contexts. In a soluton of the type proposed n secton 3, most kernels have the same number of context loadngs, and these contexts fll the whole dynamc part of the CM. Therefore, t s very common that a context can only replace the contexts that belong to the prevously executed kernel, whch reduces the number of potental combnatons. For example n Fg., K contexts can only replace K contexts. However, for some kernels there may be some addtonal possbltes. Thus, n Fg. the number of K 3 context loadngs s lower than the sze of the dynamc part. Ths mples that the contexts of the followng kernel, K, can replace ether K or K 3 contexts, what ntroduces addtonal choces. To generalze ths dea, we could say that a gven context can replace any prevously executed one, only f the dynamc part of the CM s not completely used between the correspondng kernel executons, lke between kernels K and K. In ths case we wll say that there s a drect path

K K K K K 3 K 3 = =3 =3 K = = = Fg. 3. Evaluaton of CM possble replacements of Fg., between one K context and the contexts wth a drect path. s the number of bt changes between contexts. between contexts (whch can also be appled to consecutve kernels). The procedure that we propose requres computng (number of bt changes) between all the contexts wth a drect path. Fg. 3 shows the possble drect paths and ts s to one K context. characterzes each replacement, and can be used as a gudelne to buld the fnal allocaton. In ths way, f the replacements wth the lowest are consecutvely allocated we wll obtan a soluton wth a low total. Ths wll be optmal f t s composed of all the replacements wth lowest. However, we mght thnk that sometmes the optmal set of replacements cannot be obtaned n ths manner, because the selecton of some partcular replacements mght prevent some future selectons that may produce an overall reducton. In order to check ths knd of solutons we could explore some replacement allocatons that produce a ncrease. Then, f no mprovement s obtaned later, these replacements wll be dscarded. Fg. 4 llustrates the exploraton process. Frst, the replacements are numbered n ascendng order n accordance to. The replacements wth the lowest number (that can be allocated) are evaluated up to a certan depth. For example n Fg. 4 the frst branch of the exploraton tree s explored to a depth of 3, and t s composed of the replacements {,,5}. Note that t s assumed that the allocaton of replacements and prevent the allocaton of 3 and 4, whch forces to allocate 5. Then, we check for some solutons that mght mprove the result. Thus, n Fg. 4 we now consder the branch {,3,4}, whch n some cases may mprove the result of the frst branch. It depends on the actual values of. It should be noted that t s not necessary to explore any branch that begns wth {,}, snce the best possble branch would be {,3,4}, whch s certanly equal or worse that any of the two explored. All the elements of {,3,4} are equal or worse than {,3,4}. Ths consderaton dramatcally reduces the exploraton tme, snce the exploraton of all the branches may mply a 5 c. combnatoral exploson even f carred out up to a small depth. After that, the best replacement ( n the fgure) s allocated and the exploraton contnues n the same way. As the exploraton depth grows the probablty of fndng a better soluton ncreases. In the lmt, when the depth equals the number of possble replacements, the optmal soluton s always found. However, the experence shows that the exploraton tme grows exponentally as the depth ncreases, whch makes unfeasble to perform the optmal exploraton n medum and bg sze problems. The hgher depth that can be handled s typcally around 5. It should be noted that f the tree s vsted from the top to any leaf, t s not possble to fnd a descendng edge sequence. Ths fact guarantees that every soluton s only generated once. For example, the replacement sequence, {4,,3}, wll never be generated, as t s equvalent to the sequence {,3,4}. 5. Expermental Results b. a. We have carred out a seres of experments n order to demonstrate the effectveness of the proposed technques. We have chosen MorphoSys [-] as the target multcontext system n order to use real system parameters. MorphoSys has a confguraton memory of 3 contexts, and each one s composed of 8 words of 3 bts. Therefore, each context has a total of 56 bts. Ths data has been used n order to generate the experment nformaton as explaned below. The experments are formed of three real applcatons, as well as a set of randomly generated ones. MPEG s a standard for vdeo compresson, whle ATR stands for Automatc Target Recognton and we consder ts two man tasks SD (Second evel Detecton) and FI (Fnal Identfcaton). Both tasks, SD and FI, have the same number of kernels and contexts, and therefore we present 3 d. 4 e. 5 c. Fg. 4. Exploraton tree example. The exploraton depth s assumed to be 3. The replacements have been numbered n ascendng order of, and the edge labels stand for the explored replacements.

Expermental data T Proposed soluton (All optmzatons) Exploraton tme Depth= Depth= Only T optmzatons Only allocaton optmzatons T Wthout optmzatons Ex: =3; C ={,5,5} 7 448.s 9.78s 36 4 354 4756 (9%) (48%) (43%) (94%) Ex: =4; C ={6,5,3,7} 8 78.s 8.7s 937 84 7578 985 (3%) (5%) (5%) (37%) ATR (SD and FI): =4; C ={4,4,4,} 7 688.6s.37s 874 84 84 78 (7%) (7%) (8%) (48%) Ex3: =5; C ={,5,7,8,3} 8 998.s 9.36s 347 5 56 633 (6%) (8%) (74%) (%) Ex4: =6; C ={8,,6,3,4,} 38 3978.s 9.69s 477 57 589 78 (%) (5%) (48%) (8%) Ex5: =7; C ={5,8,,,9,,6} 47 59.s 7.68s 5767 57 638 698 (%) (%) (%) (34%) MPEG: =7; C ={8,4,,6,6,,4} 48 58.5s 4.48s 598 7 77 857 (6%) (46%) (4%) (68%) Ex5: =8; C ={,5,9,3,8,,,} 44 4954.s 5.34s 558 6 688 7784 (3%) (39%) (37%) (57%) Ex6: =;C ={,5,,,5,,7,9,,,3,8} 78 9575.8s 43.74s 84 9 64 844 (6%) (8%) (%) (34%) Ex7: =5 C ={5,,8,9,7,9,7,,5,3,8,,,6,8} 9 5347.6s 6.4s 774 39 638 9 (5%) (8%) (5%) (4%) Ex8: = C ={8,7,,,4,,7,5,5,7,4,4,3,6,8, 344 96 97 496 84 794.7s 9.3s 4,6,5,,} (3%) (7%) (6%) (%) T = Total number of context loadngs. = Number of bt changes. RI= Relatve ncrement of wth respect to the proposed soluton when some of the optmzatons are not performed. Table. Expermental results. the results for only one experment. As shown n Table the proposed methodology generates smlar results for both, real and synthetc experments. The numbers of contexts n the real applcatons (for example {8,4,,6,6,,4} for MPEG) correspond to real data, however all the confguraton patterns have been randomly generated, snce we do not know ts actual form. In a frst approach all the bts were ndependently generated. Ths led us to a completely random structure such that any replacement mpled almost the same number of bt changes. As ths s not the real case, we decded to change the confguraton generaton procedure. In a system lke MorphoSys each 3-bt context word confgures the functonalty of one sngle reconfgurable cell. The structure of a context word s not completely random, and keeps wthn certan ranges. Moreover, a partcular functonalty of a cell may be later reused by a later mplemented context. The replacement of these two cell confguratons would produce no power consumpton. Consequently, frst we randomly generated a bg number of 3-bt context words wthn the archtectural constrants, and then confguratons were randomly composed of the generated context words. Ths enables us to generate experments wth smlar features to real applcatons. In order to demonstrate that the proposed schedulng technque always generates the optmal soluton for context schedulng, we mplemented an exhaustve search algorthm, whch can explore the whole search space for medum and small sze examples (lower than 7). The results for both methods match n all those cases, however the exploraton tme s order of magntudes smaller for our algorthm. Table and Fg. 5 summarzes the expermental results we have obtaned. In order to compare the confguraton latency and power consumpton mprovements, we present the number of context loadngs ( T ) and bt changes () for the proposed technque, as well as for three greedy approaches that do not perform some of the proposed optmzatons.. The frst algorthm only performs context loadng ( T) optmzatons (column 6 n Table and whte bars n Fg. 5). s obtaned when allocaton s not consdered. The relatve ncrements are wthn -3% worse than the proposed approach.. The results presented n columns 7-8 and the gray bars n Fg. 5, show for only allocaton optmzatons. In ths case we assume that contexts are loaded when necessary. Now ncrements range from 5% to 74%. Smlarly, T ncrements range form 5 to 8%.

3. Fnally, we present the results for the frst generated soluton when no optmzaton s consdered (column 9 n Table and black bars n Fg. 5). ncrements are now really bg: -%. T s the same as n pont. In all cases the performance and power mprovements are clearly sgnfcant. We have carred out all experments for a depth range from to, and the mpact on the fnal result s neglgble. The reason s that most contexts replace just executed contexts, whch does not prevent the allocaton of other replacements. The replacement wth lowest can almost always be allocated leadng to a quas-optmal soluton. In the table we show the exploraton tme for two dfferent values of depth. The algorthm s carred out n a really short exploraton tme for a depth=. We could say that the proposed technque provdes substantal performance and power mprovements n very short exploraton tmes. 6. Conclusons and Future Work In ths paper we have presented a new approach to the problem of context schedulng and allocaton for multcontext reconfgurable systems targetng DSP and multmeda applcatons. The goals of ths approach are to mnmze confguraton latency and power consumpton. The analyss of a mathematcal model enabled us to deduce an algorthm for context schedulng that generates the optmal soluton n all cases. We also propose a technque to effcently explore context allocaton possbltes, whch leads to an overall power mprovement wthout affectng the acheved system performance. The effectveness of the proposed work s analyzed n the expermental results secton. The results for real applcaton and synthetc experments valdate our assumptons, and show the qualty of the technque. Future work wll explore the power mprovements that can be obtaned when we addtonally consder confguraton latency ncreases. References [] G. u, H. Sngh, M. ee, N. Bagherzadeh, F. J. Kurdah, E. M. C. Flho, MorphoSys: An Integrated Re-confgurable Archtecture, Proceedngs of the 5 th Internatonal Euro-Par Conference Toulouse, France, August/September 999, pp. 77-734 [] R. Maestre, F. J. Kurdah, N. Bagherzadeh, H. Sngh, R. Hermda, M. Fernandez, Kernel Schedulng n Reconfgurable Computng, DATE Proceedngs, pp. 9-96, 999. [3] R. Maestre, M. Fernandez, R. Hermda, N. Bagherzadeh, A Framework for Schedulng and Context Allocaton n Reconfgurable Computng, Internatonal Symposum on System Synthess Proceedngs, San Jose, Calforna, USA, pp.34-4, November 999. C 8 6 4 Only loadng optm zaton O nly alloc aton optm z atons No optm z aton Ex Ex ATR Ex3 Ex4 Ex5 MPEG Ex6 Ex7 Ex8 Ex9 Experm ents Fg. 5. Relatve ncrement of when the proposed optmzatons are not performed. [4] I.Ouass, S. Govndarajan, V. Srnvasan, M. Kaul and R. Vemur, "An Integrated Parttonng and Synthess system for Dynamcally Reconfgurable Mult-FPGA Archtectures", 5th Reconfgurable Archtectures Workshop, 998 (RAW'98). [5] M. Vaslko and D. At-Boudaoud, "Archtectural Synthess Technques for Dynamcally Reconfgurable ogc", 6th Internatonal Workshop on Feld-Programmable ogc and Applcatons, FP '96 Proceedngs, pp.9-96. [6] M. Vaslko and D. At-Boudaoud, "Schedulng for Dynamcally Reconfgurable FPGAs", n Proceedng of Internatonal Workshop on ogc and Archtecture Synthess, IFIP TC WG.5, Grenoble, France, Dec. 8-9, 995, pp. 38-336. [7] K. M. GajjalaPurna, D. Bhata, "Temporal parttonng and schedulng for reconfgurable computng", Proceedngs of IEEE Symposum on FPGAs for Custom Computng Machnes, 998, pp..39-33. [8] S. Hauck, Confguraton Prefetch for Sngle Context Reconfgurable Coprocessors ACM/SIGDA Internatonal Symposum on Feld-Programmable Gate Arrays, pp. 65-74, 998. [9] K. M. GajjalaPurna, D. Bhata, "Temporal parttonng and schedulng for reconfgurable computng", Proceedngs of IEEE Symposum on FPGAs for Custom Computng Machnes, 998, pp..39-33. [] M. Kaul and R. Vemur, Temporal Parttonng Combned wth Desgn Space Exploraton for atency Mnmzaton of Run-Tme Reconfgured Desgns, Desgn, Automaton and Test In Europe Proceedngs, 999 (DATE 99), pp. -9. [] H. Sngh, M. ee, G. u, F. J. Kurdah, N. Bagherzadeh, T. ang, R. Heaton and E. M. C. Flho, MorphoSys: An Integrated Re-confgurable Archtecture, Proceedngs of the NATO Symposum on System Concepts and Integraton, Monterey, CA, Aprl 998. [] H. Sngh, N. Bagherzadeh, F. J. Kurdah, G. u, M. ee, E. Chaves, R. Maestre, MorphoSys: Case Study of a Reconfgurable Computng System Targetng Multmeda Applcatons, Desgn Automaton Conference Proceedngs, DAC-.