Frantz LOHIER 1,2, Lionel LACASSAGNE 1,2, Pr. Patrick GARDA 2

Size: px

Start display at page:

Download "Frantz LOHIER 1,2, Lionel LACASSAGNE 1,2, Pr. Patrick GARDA 2"

Paul Walker
5 years ago
Views:

1 A New Mehodology o Opimize DMA Daa Caching: Applicaion owards he Real-ime Execuion of an MRF-based Moion Deecion Algorihm on a muli-processor DSP Franz LOHER 1,2, Lionel LACASSAGNE 1,2, Pr. Parick GARDA 2 franz@lohier.com, lionel@lis.jussieu.fr, garda@lis.jussieu.fr 1 Elecronique, nformaique, Applicaions (EA) Burospace, Ba. 4, Roue de Gisy, Bièvres Cedex, France ABSTRACT A novel approach o opimise daa caching based on a muli-dimensional DMA is used o implemen a robus moion deecion algorihm. Qualiaively, we demonsrae he adequacy of his generic approach owards complex low-level image algorihms under real-ime consrains. Quaniaively, he C80 implemenaion improves by 4 former published performance for his algorihm. 1. nroducion 1.1 A Novel approach o opimize 2D /O sreams While analyzing he recen archiecural evoluions, i appears ha on he one hand : RSC and DSP cores are converging owards combined VLW/SMD archiecures ; DMA co-processors now sand as he only major archiecural difference beween he RSC and DSP world. They offer more predicabiliy a he cos of a more difficul programming sage (hey feaure selfmodifying parameers and muli-dimensions suppor as in [4]). Curren programming mehodological rends, on he oher hand, are sill limied : 2 Laboraoire des nsrumens e Sysèmes (LS) Universié Pierre e Marie Curie (UPMC), 4 place Jussieu B.C. 252, Paris Cedex 05, France DSP developmen ools are lacking whereas implemenaion is raising in complexiy (SMD/sofware pipelining, C8x s VLW operaions, advanced DMAs); DSP flexible simulaion/implemenaion plaform (such as MaLab/Simulink, Polemy or Hypercepion) are no suied for 2D processing. For hese applicaions, he 1D synchronous daa flow (SDF) represenaion domain seems he mos maure ; The code generaed from hese plaforms isn opimal and adding new processing kernels is ofen lacking a flexible re-usable framework. To ackle hese issues, we recenly inroduced a SDF oriened programming mehodology [2][3]. seeks he enhancemen of daa localiy by chaining he execuion of processing operaors (i.e nodes) hence minimising he global amoun of parallel DMA ransfers. This approach grounds on emplaes which se up a generic framework o expand node s libraries and compose complex processing chains dynamically. Templaes derive from he srucuring elemen required/produced by algorihmic operaors in he sense ha hey also gaher node s implemenaion consrains. These addiional mulipliciy consrains relae o opimizaion echniques such as he use of SMD operaion, loop

2 unrolling or sofware pipelining [1][3]. These echniques maximise he usage of processor s resources hrough he execuion of several loop ieraions in parallel. n a muli-processor conex, he ulimae goal of our approach is o se up all he synchronous DMA requess from a generic chain s descripion, enhancing daa localiy while mainaining insrucion caching low. Based on his mehodology we have implemened a complee low-level image processing library (more han 60 nodes) on he C80. This library feaures dynamic parsing of chains descripion (nodes emplae, image sizes, daa cache sizes, number of processors), auomaic DMA requess generaion and synchronisaion wih processing. The design space isn auomaically explored owards an opimal use of heerogeneous compuaional resources and communicaion channels as wih plaforms such as he Syndex generic approach [7]. nsead, pariioning on homogenous resource is user-guided bu reconfigurable a run-ime and since i relies on opimized 2D /O sreams (in erms of daa localiy enhancemen combined wih advanced opimizing implemenaion echniques), i improves he parallelizaion efficiency. 1.2 The moion deecion algorihm To illusrae he use of our library, we describe he implemenaion of he robus Markov Random Fields (MRF) based moion deecion algorihm proposed in [6]. Saring from an image difference, he CM (eraed Condiional Mode algorihm) is used o compue he resuling binary label picure in an ieraive manner. acs as a low-pass moion filer o locally minimize a spaioemporal energy model. eraions are performed a pixel level ( sie recursive ) and he suggesed number of ieraions is 4. npu(+1) AbsDiff O(+1) npu() x x 2 (Thres.) σ 2 (+1) Buffering (delay) 8 bis daa sream 1 bi daa sream Single in/ou value E(+1) sep1 (4PPs: 6ms) O() Alhough his algorihm is sub-opimal, i nowadays involves a challenging compuaional load when seeking real-ime (RT) processing on large images. Also, for his algorihm, he RT performance sand as an imporan crierion owards a good deecion efficiency (especially when fas moion consiue he sequence). A complee descripion of he approach is deailed in [2]. As a synopsis of i, a pre-possessing phase merges he absolue difference of 2 images, he variance of i and he binary hresholding of he absolue daa (he iniial labels E +1 ). This phase is followed by he CM regularizing algorihm as shown in figure The archiecural mapping The C8x has 1 general purpose RSC processor (MP) and 4 advanced SMD/VLW DSPs (or PPs for Parallel Processors) [4]. This suies he processorfarm model where he MP synchronizes he E() σ 2 () E(-1) C M (Num. er. ) E() sep2 (4PPs, 4 passes: 74ms) FG. 1 : Synopsis of he MRF algorihm 8 bis binrarised resul

3 PPs represening he bulk processing power. The pre-processing and he CM passes require jus 2 image scans. For each scan, we use an SPMD pariioning scheme. The image is spli among he 4 PPs and since each sub-region doesn fi in he PP s general purpose inernal daa memory, he daa are brough using muliple DMA requess while double buffering. 2.1 mplemening he pre-processing The pre-processing sage is done in a single pass hanks o a chain connecing several nodes hrough heir corresponding emplaes. Templaes' geomery are mosly described by 4 parameers. w h corresponds o he minimum amoun of daa ha mus be presen in inernal memory. This quaniy ofen exceeds he surface of he operaors' srucuring elemen because we inegrae size requiremens ha raise while implemening generic opimizaion echniques owards an efficien algorihmical mapping ono hardware resources. Here, he goal is o lower he developmen burden and he granulariy of each node by assuming we have a minimum amoun of daa ha maches he number of parallel ieraions processed in one occurrence of he nodes' loop kernel. Following he same goals, we also impose mulipliciy consrains on he number of addiional daa ha can be presen in inernal memory for he horizonal and verical direcion and described hanks o parameers w s and h s respecively. These emplaes allow he composing of complex processing chains which, ogeher wih he reduced granulariy of nodes, allows o enhance daa localiy and reduce parallel DMA ransfer as well as insrucion caching owards improved realime performance. Each node has a synchronized inpu and oupu emplaes whose parameers are deailed on figure 2. npu(+1) npu() Exernal buffer nernal processing chain (1) AbsDiff 2 1,1,1 6 1,2,1 16 2,8,1 Templae parameers: w h, w s, h s min. size nernal (inpu) emplae geomery sep h s (2) x (3) x 2 (4) (Thres.) nernal (oupu) emplae geomery 2,1,2,1 O(+1) (8 bis daa) E(+1) (1 bi daa) Processing node Templae geomery Nex and since he processed images do no fi in inernal memory, we need o esimae and minimise he number of all inpu and oupu parallel DMA requess beween each synchronized execuion of he chain. This is based on he exac surface of daa we can process and depends on he overall size consrains ha are propagaed along each pah of he chain ha connecs o an exernal buffer. Propagaion is based on emplae parameers and he solving of simple and independen diophanine equaions (for he ver. and horz. direcion). These equaions appear as we ry o synchronize he amoun of daa produced by a node wih he number of daa consumed by he nex node in he chain. This synchronizaion mechanism is depiced wih figure 3. Recursively, he gained synchronizaion consans (which appear has 2 congruences: β ο φ(φ) e γ ο τ( Γ)) are propagaed beween he oupu and inpu of each node unil he leading node of each pah is reached. This procedure enables he merging of he called "virualemplae" which feaures new parameers w s FG. 2 : The pre-processing chain h w

4 Synchronisaion consrains β O φ ( Φ ) 0(1) ()/(+)1 (16,8) (2,1) Abs Diff O β = β O γ = γ Thres ( w, w ) = ( w ( h, h ) = ( h s s + w + h φ, w τ, h Γ Φ ) ) γ O τ ( Γ) 1(1) β Ι/Ο, γ Ι/Ο : number of horizonal/verical emplae ha fi in he inernal caching buffer of size S-w' h' O O O w + ws β = w + ws β O O O h + hs γ = h + hs γ gahering all he size consrains (w ' h ',w s ',h s '). Then, according o he original buffer size, we esimae he combined number of horizonal (β) and verical (γ) virual emplaes we can cache in an inernal memory space of S-w 'xh ' byes. A complee descripion of his procedure can be found in [2]. As a synopsis of our approach, able 1 shows (β,γ ) couples we gain while applying his procedure on he firs sep of he moion deecion algorihm wih W=H=256 and S=2048 for all he inpu and oupu buffers. Min(β) and Min(γ) are hen used o synchronize all he nodes as well as o pariion daa among all he processors (in a SPMD fashion). The subsequen ransparen and dynamic programming of DMA requess allows o synchronize all inpu and oupu ransfers (for all he processors) owards sofware managed daa caches. TAB. 1 : Generaing & synchronizing DMA requess Pahs Virual emplae (w, w s,h,h s) AbsDiff Thres 16,8,2, AbsDiff x 16,8,1, AbsDiff x 2 16,8,1, O Thres AbsDiff 2,1,2, Mulidimensional suppor allows he DMA o address any region of ineres and is arihmeic capabiliies permis o implemen he double buffering scheme wih minimum processors involvemen. This grealy favors FG 3. Geing he virual emplae β γ insrucion cache coherency and hence, performance. Wih he combined use of our mehodology and well known generic opimizaion echniques o implemen he nodes in assembly language (sofware pipelining, SMD operaions), he preprocessing phase achieves 6 ms for a image wih 4 PPs working in parallel a 40 Mhz. 2.2 mplemening he CM node Thanks o he dynamic infrasrucure of our C80 library implemening our generic DMA caching mehodology, he sysem is reconfigured a runime o run he CM. This sep is implemened as a single-node chain. The same /O generaing/opimizing algorihm we described is used bu no deailed here. nsead, he various involved emplaes are shown in Figure 4. To lower he number of DMA requess required, some exernal buffer s daa are organized sequenially. This allows o define he CM node as diadic (feauring jus 2 inpu physical buffers). Also, here is no synchronizaion regarding he processed daa ha are shared beween 2 processors and despie he exising daa dependency inroduced by he sie recursive updaes. This simplificaion has very lile impac on he moion deecion efficiency and permis SPMD pariioning wih maximum performance. The CM node is wrien in VLW algebraic assembly language. We ook advanage of

5 Single inpu DMA reques E(-1) 8 bis binrarised resul O() 32 3,32,1 4 3,4,1 4 3,4,1 4 ier. CM 32 1,32,1 4 1,4,1 E() 4 1,4,1 E() Single oupu DMA reques FG. 4 : he CM single-node chain E(+1) he new code compacor (ppca) deailed in [5] o auomaically gaher assembly operaions ino 64 bis VLW insrucions. Regiser allocaion was also done by ppca ha efficienly compaced 313 operaions ino 199 VLW insrucions. Since he core of he node requires more regiser resources han available (each PP feaures 8 daa regisers and 2 ses of 5 address and 3 index regisers), we manually insered regiser spilling operaions based on ppca feedback logs. We inroduced he use of 3 LUTs (iniialised by he MP) o fasen calculaion whereas he sie recursive version of he CM algorihm prevened us from using he SMD capabiliies of he ALU (bu his version of he CM demonsraes faser converging). The core of he kernel requires 40 VLW insrucions per pixel which is 4 imes faser han wha PP s opimising compiler produces on a C version of he same algorihm. On images, a 40 Mhz, we measured 74 ms for he 4 CM passes which is very close o he opimisic opimum: (num. pass)/(40mhz 4(num. PP))= 65 ms. 3. Conclusions This paper inroduces wo imporan resuls. Qualiaively, we demonsrae ha our mehodology is generic enough o cope wih a complex low-level image algorihm. is also flexible enough o be dynamic and independen of he image size and number of processors. Mos imporanly, wih respec o generic opimizaion echniques, i opimises 2D /O sreams and allows good speed-up owards RT performance. Quaniaively and hanks o he framework of our general c80 image processing library, we gain a speed facor of 4 wih respec o previously published processing duraion for he same algorihm [6]. When considering he 60 Mhz version of he device, we can increase he processing rae up o 18 images per second (i/s) for images. The use of more efficien memory, like synchronous dynamic RAM (we used DRAM), would even furher increase ha rae. Pre-prop.& CM LS-UPMC/EA LS-NPG 128x128 - DSP 96002: 15 i/s Cnaps 256 PEs: 10 i/s 256x Mhz C80: 12 i/s - REFERENCES [1] F. Lohier, L. Lacassagne, Pr. P. Garda, Programming Technique for Real Time Sofware mplemenaion of Opimal Edge Deecors. Proc. of DSPWORLD 98, Spring Conference [2] F. Lohier, L. Lacassagne, Pr. P. Garda, A Generic Mehodology for he Sofware Managing of Caches in Muli-Processors DSP Archiecures. EEE n. Conference on Acousics, Speech and Signal Processing. CASSP 99. [3] Aricles [1]&[2] are online a [4] TMS320C8X: Sysem-Level Synopsis, Texas nsrumens SPRU113b. [5] Jihong Kim, Graham Shor, Performance Evaluaion of Regiser Allocaor for he Advanced DSP of TMS320C80. Proc. of CASSP 98 [6] A. Caplier, F. Luhon, C. Dumonier, Real ime mplemenaions of an MRF-based Moion Deecion Algorihm, Journal of Real Time maging [7] Y. Sorel, Massively Parallel Compuing Sysems wih RT Consrains : The Algorihm Archiecure Adequaion Mehodology, Proc. Massively Parallel Compuing Sysems schia aly, Mai 1994.

Y. Tsiatouhas. VLSI Systems and Computer Architecture Lab

Y. Tsiatouhas. VLSI Systems and Computer Architecture Lab CMOS INEGRAED CIRCUI DESIGN ECHNIQUES Universiy of Ioannina Clocking Schemes Dep. of Compuer Science and Engineering Y. siaouhas CMOS Inegraed Circui Design echniques Overview 1. Jier Skew hroughpu Laency