A Reconfigurable Accelerator for 32-Bit Arithmetic

Size: px
Start display at page:

Download "A Reconfigurable Accelerator for 32-Bit Arithmetic"

Transcription

1 A Reconfigurable Accelerator for 32-Bit Arithetic Reiner W. Hartenstein, Rainer Kress, Helut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D Kaiserslautern, Gerny Fax: , eil: Abstract This paper introduces the MoM-3 (Map oriented Machine 3) architecture as an accelerator for applications with a large aount of 32-bit arithetic. The data nipulations in the inner loop of the application can be configured as a coplex hardware operator into special fieldprograble devices. Several coplex address generators are prograed to ipleent the control structures of the algorith, so that data is accessed copletely under hardware control, without requiring eory cycles for address coputations. The MoM-3 supports pipelining on stateent and expression level. The copiler takes C as input language and produces the configuration code for the field-prograble devices and the address generators without further user interaction. Acceleration factors in the range of 12 to 62 have been obtained, copared to a high-end Sun SPARC workstation. 1. Introduction Many coputation-intensive algoriths take too uch execution tie, even on a well-equipped odern workstation. This opens a rket for hardware accelerators of all kinds. If the rket share is big enough, ASIC-based accelerators are designed. JPEG and MPEG ige and video copression can be based on full custo circuits fro C-Cube [3] or Zoran [12], for exaple. But such ASICs are expensive and suffer fro a long developent tie. If a user doesn t require the highest perfornce possible for a specific algorith, but a speed-up for several different algoriths, he ight be better off with a kind of custo coputing chine. A lot of coercial and noncoercial solutions have already been presented, like Splash-2 [1], DEC Perle-1 [2], Virtual Coputers [4], and CHS2x4 Custo Coputer [10]. All of these are based on coercially available FPGA technology, especially the SRAM-based XILINX devices [11]. All of the currently available FPGAs are configurable at bit-level. Therefore they provide only odest perfornce and capacity when configuring 32-bit datapaths which are standard for current icrocoputers. As a consequence, ost custo coputing chines do not provide operators which tch the word length of standard high level language data types. Especially arithetic operators of a significant word length consue too uch area and suffer fro large delays, king it difficult to accelerate coputation-intensive algoriths on ost custo coputing chines. The MoM-3 design ais at the following goals: 1. progrability fro standard high level languages like C, without requiring hardware expertise, 2. acceleration of scientific algoriths that process a large aount of data. Since ny of these algoriths do a large aount of arithetic, the MoM-3 needs a coputational device that supports arithetic operations on standard data types like 8- bit, 16-bit and 32-bit integers. The device ust be reconfigurable, because different algoriths shall be accelerated by the sae MoM-3 hardware. For lack of coercially available devices that fulfil these requireents, we developed our own, the so-called rdpa [6]. The MoM-3 kes use of another characteristic of scientific algoriths, to achieve further speed-up. Since ny of these algoriths are organized in loops, which copute indices to large data arrays, the MoM-3 contains several address generators to copute the appropriate data address sequences copletely under hardware control. In the following section, this paper introduces the hardware structure of the MoM-3. Afterwards, the features of the C copiler for the MoM-3 are outlined. The fourth section explains the way algoriths are run on the MoM-3 by eans of an exaple. The final sections provide a perfornce coparison and conclude the paper. 2. The MoM-3 Hardware The MoM-3 architecture is based on the Xputer chine paradig [7]. MoM is an acrony for Map oriented Machine, because the data eory is organized in 1 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

2 Data Sequencer GAG GAG GAG ctrl ralu SW M3C eory addresses Data Meory Figure 1. Block Diagra of an Xputer data two diensions like a p. Instead of the von Neunnlike tight coupling of the instruction set to the data nipulations perfored, an Xputer shows only a loose coupling between the sequencing echanis and the ALU. That s why an Xputer efficiently supports a reconfigurable ALU (ralu). The ralu contains coplex operators which produce a nuber of results fro a nuber of input data (figure 1). All input and output data to the coplex operators is stored in so-called scan windows (SW). A scan window is a prograing odel of a sliding window, which oves across data eory under control of a data address generator. All data in the scan windows can be accessed in parallel by the ralu operator. The ralu operators are activated every tie a new set of input data is available in the scan window. This so-called data sequencing echanis is deterinistic, because the input data is addressed by Generic Address Generators (GAG). They copute a deterinistic sequence of data addresses fro a set of algorith-dependent paraeters. An Xputer Data Sequencer contains several Generic Address Generators running in parallel, to be able to efficiently cope with ultiple data sources and destinations for one set of coplex operators. In the MoM-3, the Data Sequencer is distributed across several coputational odules (C-Modules), as can be seen in figure 2. Each C-Module consists of a Generic Address Generator (GAG), an ralu subnet and at least two egabytes of local eory. All C-Modules can operate in parallel when each Generic Address Generator accesses data in its local eory only. The MoM-3 includes up to seven C-Modules. Apart fro the local eory access, two eans of global counication are available. First, the ralu subnets can exchange internal results with their neighbours by use of the ralu interconnect, without disturbing parallel activity. Second, the Generic Address Generators can access data eory and ralu subnets on other C-Modules using the global MoMbus. This can be done only sequentially, of course. The global MoMbus is used by the MoM-3 controller (M3C) as well, to reconfigure the address generators and the ralu whenever necessary. The MoM-3 controller is the coordinating instance of the Data Sequencer, which ensures a well-defined parallel activity of the Generic Address Generators by selecting the appropriate paraeter Host CPU VMEbus MoM-3 MoMbus <-> VMEbus Interface e. MoMbus Figure 2. The MoM-3 Hardware sets for configuration. Via the MoMbus-VMEbus interface, the host CPU has access to all eory areas of the MoM-3 and vice versa. The Generic Address Generators have DMA capability to the host s in eory, reducing the tie to download a part of an application for execution on the MoM-3. The host CPU is responsible for all disk I/O, user interaction and eory allocation, so that the MoM-3 copletely uses the functionality of the host s operating syste. The printed circuit boards with the MoM-3 controller and the C-Modules reside in their own 19 inch rack with the MoMbus backplane. The only connection to the host coputer is through the bus interface. That way, the MoM-3 could be interfaced to a different host with inil effort. The part of the bus interface that is injected into the host s backplane would have to be redesigned according to the host s bus specifications. All other parts could rein the sae. 2.1 The Data Sequencer GAG C-Module 1 A D e. ralu ralu interconnect ralu interconnect ralu C-Module interconnect 7 A D GAG e. ralu The MoM-3 Data Sequencer consists of up to seven Generic Address Generators and the MoM-3 controller. The Generic Address Generators operate in a two-stage pipeline. The first stage coputes handle positions for the scan windows (Handle Position Generator). Each Handle Position Generator controls one scan window. A handle position consists of two 16-bit values for the two diensions of the data eory. The sequence of handle positions describes how the corresponding scan window is oved across the data eory (figure 3). Such a sequence of handle positions is called scan pattern. A Handle Position Generator can produce a scan pattern corresponding to four nested loops at the ost. It is prograed by specifying a set of paraeters, such like starting position, 2 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

3 Generic Address Generator Handle Position Generator Control Logic handle position sequence Meory Address Generator eory address sequence = handle position read 1,1 read -1,0 read 0,1 write 0,0 write 0,1 = no access -1,0: r Figure 3. Generic Address Generator 0,1: r/w 0,0: w 1,1: r increent value, and end position of a loop, each both for the x and y diension of the data eory. The second pipeline stage coputes a sequence of offsets to the handle positions, to obtain the effective eory addresses for the data transfers between the scan window / ralu and the data eory. Therefore this stage is called Meory Address Generator. The range of offsets y be 32 to +31 in both diensions of the data eory. The sequence of offsets y be prograed arbitrarily, but at ost 250 references to the data eory can be de fro one handle position. The offset sequence y be varied at the beginning and at the end of the loops of the Handle Address Generator. This allows to perfor extra eory accesses to fill a pipeline in the ralu at the beginning of a loop, as well as additional write operations to flush a pipeline at the end. The address parts of the two diensions y cobined to a real linear eory address in four ways: one row of two-diensional eory y consue an address space of 10, 12, 14, or 16 bits. This allows to adjust the size of the data eory to the size of the processed data, to reduce wastage of address space. The software environent contains a eory nageent odule, which kes use of the unused area, fro the end of a data row to the next appropriate power-of-two boundary. If the algorith uses another array of data, which fits into the reining space, such two data arrays are placed in eory, one after the other. After the concatenation of the two address parts to a linear address, a 32-bit base address is added, to obtain the effective eory address. The base address typically is the starting address of the data array referenced by this Generic Address Generator, as deterined by the eory nageent software. The Meory Address Generator itself is a three stage pipeline. The first stage perfors the offset calculations and the cobination of address parts to a linear address. The second stage adds the 32-bit base address, and the third stage handles the bus protocol for data transfers. The Generic Address Generators run in parallel. They synchronize their scan patterns through the activations of the ralu. All scan patterns proceed until either they detect an explicit synchronization point in the offset sequence of the Meory Address Generator, or they are blocked in a write operation to eory, waiting for the ralu to copute the desired result. The hardware controlled generation of long address sequences allows to access data in eory every 120 ns, using 70 ns eory devices and a conventional, noninterleaved eory organization. A further speed-up of eory accesses could be obtained by interleaved eory banks and by the introduction of static RAM caches, like in conventional coputers. With caches, it would ke sense to have a prograble cache controller in the MoM-3. Since the hardware generated address sequences are deterinistic, a copiler could copute a cache update strategy at copile tie, which would result in the iniu nuber of cache isses. This should provide better perfornce than the probabilistic cache update strategies usually applied. 2.2 Reconfigurable ALU The coputations in each ralu subnet of the MoM-3 are perfored by an rdpa array (reconfigurable datapath architecture). The rdpa array consists of eight rdpa chips in a two by four organization. Each rdpa chip contains an array of sixteen (four by four) datapath units (DPU). Each datapath unit can ipleent any unary or binary operator fro C language on integer or fixed-point data types up to 32 bits length. Multiplication and division are perfored by eans of a icroprogra, whereas all other operators can be executed directly. Additionally, operators like ultiplication with accuulation are available in the library of configurable operators. Each datapath unit has two input registers, one at the north and one at the west, and two output registers, one at the south and one at the east (figure 4). All registers can be used for interediate results throughout the coputation. The registers in the DPUs in fact are a distributed ipleentation of the scan window of the Xputer paradig. The datapath units can read 32-bit interediate results fro their neighbours to the west and to the north and pass 32-bit results to their neighbours in south and/or east direction. A global I/O bus allows input and output data to be written directly to a DPU without passing the fro neighbour to neighbour. The DPUs operate data-driven. The operator is applied each tie, new input data is availa- 3 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

4 west I/O bus register ALU register north the ralu controller, supports pipelining on operator level (e.g. floating point operations are ipleented as a pipeline across several datapath units), pipeline chaining [8], and pipelining of loop bodies, as shown in the exaple in section 4.2. That way, the coplex operators of subsequent loop iterations are coputed in parallel in a pipeline. Each of the loop iterations is finished to a different degree, depending on its progress in the pipeline. The whole rdpa array is configured across one serial link of the DPU in the upper left corner of the array. Each configuration word is preceeded by a two-diensional DPU address in relative fort, and the address for the icroprogra storage. As long as both DPU address coponents are different fro zero, the configuration word is passed on to a neighbouring DPU, with the appropriate DPU address coponent decreented by one. If the local interconnect to the east is busy, a configuration word y be routed to the south first (if the DPU address coponent in vertical direction is different fro zero), which results in an autotic load balancing for the slow serial links. The relative address fort allows to use the sae configuration data for rdpa arrays of arbitrary size. Configuration data words are rked by a special configuration bit set by the MoM-3 controller. That way the rdpa array y be reconfigured partially during runtie, while other areas of the array are still busy with coputations. Config- a) b) op icrocontrol prograble FIFO global I/O bus status configuration MoMbus to GAGs, Host, Main Meory rdpa address generation unit register file rdpa control unit rdpa register register op op op op op op south east Figure 4. Datapath unit: a) sybol; b) block diagra ble either fro the bus or fro a neighbouring DPU. This decentralized control siplifies pipelining of operations, when each operation takes a different tie to execute. The array of DPUs expands across rdpa chip boundaries in a way that is copletely transparent to the copiler software. To overcoe pinout restrictions, the neighbour to neighbour connections are reduced to serial links with appropriate converters at the chip boundaries (figure 5). Although pipelining drastically reduces the effect of the conversion fro 32 bits to 2 bits, this still y turn out a bottle neck in the current rdpa prototype. With state of the art IC technology and packages larger than 144 pins, we could build an rdpa with 66 MHz serial links (instead of 33 MHz) and four bits wide counications channels. This fourfold speed-up would overcoe the shortfalls of the current prototype. In addition to the rdpa array, a MoM-3 ralu subnet (figure 6) contains an ralu controller circuit which interfaces to the MoMbus. It provides a register file as a kind of cache for frequently used data, and controls the data transfers on the global I/O bus. The rdpa, in cobination with parallel to serial converter I/O bus chip 1 serial op op op link serial to parallel converter PCB chip 2 Figure 5. Expansion of nearest neighbour interconnect across chip boundaries op op op op op op op op op op op op op op op op op op Figure 6. One subnet of the reconfigurable datadriven ALU 4 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

5 uration data is passed along with higher priority than coputational data, each tie before a datapath unit starts with the coputations for the next application of its operator. That way, both configuration and coputation are slowed down, if part of the rdpa array is busy, but both are sure to continue without deadlocks. For additional details on the rdpa see [6]. Although this publication refers to an older version of the rdpa, the in structure still is the sae, only the ipleentation has iproved. 3. MoM-3 C Copiler The C copiler for the MoM-3 takes an alost coplete subset of ANSI C as input. Only constructs, which would require a dynaic eory nageent to be run on the MoM-3 are excluded. These are pointers, operating syste calls and recursive functions. Since the host s operating syste takes care of eory nageent and I/O, the software parts written for execution on the MoM-3 do not need such constructs. There are no extensions to C language or copiler directives required to produce configuration code for the MoM-3. The copiler coputes the paraeter sets for the Generic Address Generators, the configuration code for the rdpa arrays, and the reconfiguration instructions for the MoM-3 controller, without further user interaction. First, the copiler perfors a data and control flow analysis. The data structure obtained allows restructurations to perfor parallelizations like those done by copilers for supercoputers. These include vectorization, loop unrolling, and parallelizations on expression level. The next step perfors a re-ordering of data accesses to obtain access sequences, which can be pped well to the paraeters of the Generic Address Generators. Therefore, the copiler generates a so-called data p, which describes the way the input data has to be distributed in the data eory to obtain optiized hardware generated address sequences. After a final autotic partitioning, data nipulations are translated to a ralu description, and the control structures of the algorith are transfored to Data Sequencer code. An assebler for the Data Sequencer translates the Data Sequencer code to paraeter sets for the Generic Address Generators and a reconfiguration schee for the MoM-3 controller. The ralu description is parsed by the ALE-X assebler (arithetic and logic expression language for Xputers). It generates a pping of operators to DPUs, erges DPU operators where possible, and coputes a conflictfree I/O schedule, which tches the operators speed, to keep the DPUs as busy as possible. The separation of the ralu code generation fro the rest of the copiler allows to use other devices than the rdpa to build a MoM-3 ralu, without having to rewrite the C copiler with all its optiizations. For a new ralu, only an assebler generating ralu configuration code fro ALE-X specifications has to be written. This is not too difficult, because ALE-X describes only arithetic and logic expressions on the scan windows contents. A ore detailed description of the MoM-3 prograing environent can be found in [13]. The ost iportant benefit of the MoM-3 C copiler copared to developent environents of other custo coputing chines is the fully autotic code generation. The prograer neither has to be a hardware expert, to guide hardware synthesis with appropriate constraints or copiler directives, nor has he to interfere with different stages of the copilation to get reasonable results. 4. Exaple: two diensional FIR filter The operation of the MoM-3 and the way an algorith is adapted to the non-von Neunn architecture of the MoM-3 is illustrated with a two-diensional FIR filter. The two diensional FIR (finite ipulse response) filter is one whose ipulse response processes only a finite nuber of nonzero saples. The equation for a general two diensional FIR filter is y n = k ij x n i, j. (1) i j 4.1 Straightforward Ipleentation In this exaple a two diensional filtering of second order is shown, that is indices i and j have a range of zero to two. Given the order of a filter and its weights k ij, the ost efficient ipleentation on a von Neunn coputer is to unroll the two loops of the filter kernel, to create a large expression coputing the new value of y n (see figure 7). The weights k ij now are constants, so that the copiler can optiize the ultiplications by replacing the with a tching sequence of additions and shift operations. The sae source code produces an efficient ipleentation on the MoM-3 as well. The C copiler vectorizes the expression and perfors loop unrolling to the extend of the capacity of the rdpa array. With ultiplication and accuulation being a valid operator, the expression to copute y[n][] takes three by four DPUs. The ultiplications and sution of the rows are done in a three by three array of ultipliers and ultiplier-accuulators. A colun of three routers/adders cobines the sus of rows to the coplete result (figure 8). Taking the chip boundaries into account (with the costly parallel-to-serial converters) eight of these expressions can be coputed in parallel, each in its own rdpa chip. The configuration for the whole rdpa array siply 5 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

6 For (n = 0; n < xn; n++) { For ( = 0; < x; ++) { y[n][] = k00*x[n][] + k01*x[n][+1] + k02*x[n][+2] + k10*x[n+1][] + k11*x[n+1][+1] + k12*x[n+1][+2] + k20*x[n+2][] + k21*x[n+2][+1] + k22*x[n+2][+2]; } } Figure 7. Source code of a 3 * 3 two-diensional FIR filter next handle position scan window rdpa array Figure 9. Scan window and ralu configuration of a vectorized 3 * 3 2-D FIR filter is a concatenation of eight of these operators in a two by four array, leaving always a spare row to adjust to the chip boundaries. The scan window de of the DPU registers now consists of ten words in x direction and three words in y direction, instead of the three by three words of a single filter operator. During vectorization, the copiler perfors a partial unrolling of the inner loop, so that eight consecutive loop iterations are executed in parallel, and the loop index x is increented by eight. The nuber of vectorizable loop iterations is taken fro the capacity of the rdpa array, which can ipleent eight 3 * 3 FIR filter operators in parallel. Furtherore, the C copiler knows fro the hardware configuration file the nuber of available C- Modules and splits the input array into a corresponding nuber of stripes. The required overlap of two can be coputed fro the data dependency analysis. That way all C-Modules y operate in parallel to speed-up coputation by a factor of seven, at the ost. The ALE-X assebler detects, that ost of the input values to the eight parallel expressions are used several ties in different ultiplications, and stores these values in the register file for quick access. Furtherore, the two input values in x direction, that overlap with the next loop iteration are read fro the register file as long as the pipeline is full throughout the inner loop. The copiler generates a Meory Address Generator configuration that first fills the pipeline by reading ten coluns of three input values. Within the inner loop, only eight coluns have to be read fro eory. The overlapping two coluns are taken fro the register file. The only rearrangeents to the data are the distribution of the stripes onto the C-Modules. Therefore the data p is quite straightforward. The input values are stored row by row, each strip of rows in a separate C-Module eory. All Generic Address Generators copute the sae address sequence (figure 10). Scanning the eory with a three by ten window row by row, at each scan window position ten input values are read fro eory and eight values are written as results. All values required at overlapping scan window positions are fetched fro the regis- GAG 0: 148 * 160 * (24 r + 8 w) A D ralu 0: 8 * 3 by 3 2-D FIR a a a = ultiply by k ij = ultiply & accuulate = accuulate sus of rows GAG 6: 148 * 160 * (24 r + 8 w) A D ralu 6: 8 * 3 by 3 2-D FIR Figure 8. ralu operator for a single 3 * 3 2-D FIR filter operation Figure 10.Copiler generated scan pattern for a 3 * 3 2-D FIR filter 6 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

7 ter file for subsequent accesses. The row by row scan pattern for the scan windows exactly tches the two nested for-loops of the algorith in figure 7, with the loop index of the outer loop being increented by eight. Knowing this structure of the copiler output, a perfornce estition is quite siple. Because eight filter operations are perfored in parallel, the coputation tie for the ultiplication is not the bottleneck, but the eory and register file access ties to provide the large nuber of input data. For each group of eight result values, 3 * 8 = 24 inputs have to be fetched fro eory, each eory access taking 120 ns. Two coluns of inputs are reused in the next loop iteration and can be fetched fro the register file in 60 ns each. Two of the ten input coluns are used in the coputation of two results, requiring 2 * 3 = 6 additional register file accesses. Six input coluns are used, each in three result coputations, totalling in another 6 * 3 * 2 = 36 register file accesses. The total tie to copute eight result values and store the in eory is 24 * 120 ns + 8 * 120 ns + 6 * 60 ns + 6 * 60 ns + 36 * 60 ns = 6720 ns. A single row includes = 1278 positions. Eight positions are coputed in parallel, resulting in 1278 / 8 = 160 iterations of the inner loop. One stripe takes ( * 2) / 7 = 148 rows, which is equal to the nuber of iterations of the outer loop. Using seven C-Modules in parallel, it takes 148 * 160 * 6720 ns = s to filter a 1280 by 1024 pixel ige. The filter weights y be chosen arbitrarily, because the execution tie is bound by eory I/O speed. On the MoM-3, the result would be coputed in the sae tie, regardless whether a pixel is 8 bits wide or 32 bits, because even 32- bit ultiplications would be faster than eory access tie in this exaple. The corresponding perfornce figures for von Neunn coputers in table 1 were based on the sae C source. All variables were declared as registers, the ones with the highest reference count first. Using GNU gcc copiler, all optiizations were turned on, including the exact specification of the icroprocessor type. 4.2 Manually Optiized Ipleentation bit bit Figure 11.Data p for an optiized 2-D FIR filter Fro the algorith description in figure 7, the MoM-3 C copiler cannot ke use of all optiization possibilities. If the input data is only 8 bits wide (a typical case for ost grayscale iges), four values could be transferred in a single 32 bit eory access, reducing I/O tie by a factor of four. The data p for a packed representation of a 1280 by 1024 pixel ige, using eight bits per pixel, can be seen in figure 11. Instead of using the register file, values which are required in subsequent loop iterations could be passed to the next DPU using the neighbour to neighbour interconnect. Since these transfers can be done in parallel, they are uch faster than register file accesses. Furtherore, if the eight FIR filter operators would copute the results for eight rows in parallel, instead of eight results in a single row, the neighbour to neighbour connections could be utilized to a far larger extend. Instead of six overlapping positions in the next loop iteration, twenty positions would overlap, so that fewer inputs would have to be read fro eory and ore inputs would be passed on fro DPU to DPU. The straightforward copiler ipleentation doesn t ke use of the fact, that the next row of results requires to read again ost of the inputs, which have been read during the coputation of the previous row of results. Coputing eight rows of results in parallel, the reused inputs can be fetched fro the register file instead of the eory. The ralu configuration and the scan window corresponding to these nual optiizations of the filtering algorith can be found in figure 12. In order to achieve such optiizations, nual changes to the input source are required. The passing of the input values to the next DPU for the following loop iteration has to be specified explicitly. Since the copiler perfors a vectorization, proceeding fro the inner loops to the outer loops, the parallelization of the row coputations has to 7 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

8 next handle position GAG 0: 19 * 320 * (10 r + 8 w) A D ralu 0: 8 * 3 by 3 2-D FIR GAG 6: 19 * 320 * (10 r + 8 w) A D ralu 6: 8 * 3 by 3 2-D FIR scan window rdpa array Figure 12.Manually optiized scan window and ralu configuration be done explicitly. Packed transfers of four 8-bit inputs in one 32-bit eory access cannot be specified solely in the C source. The unpacking of 32 bit values can be done in a single DPU, which accepts a 32 bit value and four ties passes an eight bit value to the next DPU for coputation. But this has to be done in the ALE-X assebler source. A description of that operation sequence in C would not pass unchanged through the copiler s transfortion and optiizations steps, so that the ALE-X assebler would not recognize an unpack operator in the resulting description and erge it to a single DPU. An ralu operator (figure 13) to copute a single result value consists of three unpacking DPUs ( u ), a three by three array of ultiply-accuulate DPUs ( and ), and three DPUs to accuulate the sus of rows ( a ). The last of these DPUs perfors the packing of four result values to a 32 bit word as well ( ap ). The partial sus of rows are transferred fro west to east to the next DPU. The operation is pipelined on expression u u u Figure 13.rALU operator for a 3 by 3 filter operation (nually optiized) a ap u a ap = unpack 32 bit = ultiply by k ij = ultiply & accuulate = accuulate sus of rows = accuulate & pack Figure 14.Scan Pattern for a 2-D FIR Filter level, because each ultiply-accuulate DPU accepts the next input operand (of the subsequent loop iteration) after it has calculated its partial su and passed the result to the next DPU to the east. That way, the whole rdpa array can be kept busy, with filter operators of consecutive loop iterations finished to different degrees in the pipeline. One filter operator fits into a single rdpa chip, so that the parallel-to-serial converters do not ipose a delay on the operation. Eight of these operators fit into the rdpa array of a single C-Module. The scan window oves one position to the right in the inner loop, but eight positions downwards in the outer loop, because eight rows are processed in parallel (figure 14). That way the overlapping inputs in each row can be transferred in parallel to the next DPU to the south, where they are required as inputs for the following iteration. The overlapping inputs of the adjacent rows of the eight operators are fetched fro eory only once and then taken fro the register file in half of the tie. One colun of eight result values requires the following tie for I/O: ten values have to be read fro eory, and eight written back. Furtherore, two input values are fetched a second tie fro the register file and six values are fetched twice fro the register file for reuse in another filter operator. The total I/O tie is 3000 / 4 = 750 ns. The division by four stes fro the fact, that every eory or register access transfers the values for four iterations. One pipeline stage of the ultiplication and accuulation takes 420 ns in the average case, because the ultiplications with constants are resolved to shift and add operations. The parallel-to-serial converters, taking 600 ns, do not account, because each filter operator is local to one rdpa chip. Although optiized, this ipleentation is still I/O bound. The optiized algorith takes 148 / 8 = 19 iterations of the outer loop. The inner loop contains = 1278 iterations, each producing a colun of eight results. 8 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

9 For (n = 0; n < xn; n++) { x00=x[n][0]; x01=x[n][1]; x10=x[n+1][0]; x11=x[n+1][1]; x20=x[n+2][0]; x21=x[n+2][1]; For ( = 0; < x; += 3) { x02 = x[n][+2]; x12 = x[n+1][+2]; x22 = x[n+2][+2]; y[n][] = k00*x00+k01*x01+k02*x02 + k10*x10+k11*x11+k12*x12 + k20*x20+k21*x21+k22*x22; x00 = x[n][+3]; x10 = x[n+1][+3]; x20 = x[n+2][+3]; y[n][+1] = k00*x01+k01*x02+k02*x00 + k10*x11+k11*x12+k12*x10 + k20*x21+k21*x22+k22*x20; x01 = x[n][+4]; x11 = x[n+1][+4]; x21 = x[n+2][+4]; y[n][+2] = k00*x02+k01*x00+k02*x01 + k10*x12+k11*x10+k12*x11 + k20*x22+k21*x20+k22*x21; } } Figure 15.Manually Optiized C Source Code The total tie to filter a 1280 by bit grayscale ige is 19 * 1278 * 750 ns = 0,018 s. For a fair coparison, the C source for the von-neunn coputers is nually optiized as well. Instead of copying the input values to the next position, three iterations of the inner loop are unrolled and a cyclic buffering schee is applied. This eliinates the copy tie for the icroprocessors, because they cannot rely on a register to register interconnect to do the copy operations in parallel. The resulting source code can be seen in figure 15. The x array is declared as unsigned character of course, to allow the copiler to do the sae optiizations on ultiplications as on the MoM-3. The xnm variables and the loop counters are declared as register variables, where the order of declarations takes into account, how often each variable is referenced. In this source code, there is no explicit unrolling of the outer loop like in the optiized MoM-3 source, because a standard icroprocessor cannot process ultiple FIR filter operators in parallel. Table 1 reveals, that the nual optiizations iprove perfornce on the conventional coputers by ore than a factor of two. 5. Perfornce Evaluation The perfornce figures for soe exaple algoriths are given in table 1. The first colun lists the algoriths. MergeSort perfors a sorting of 256k = bit integer nubers. On the MoM-3, initially six C-Modules operate in parallel, but the last erging steps are done in the eory of a single C-Module. The CPU ties for the conventional coputers are easured only for the tie to sort the nubers in eory. The two-diensional FIR filter algoriths are easured for different kernel sizes. For all ipleentations, the weights of the kernel are constants copiled into the code, allowing the copilers to replace ultiplications with optiized shift and add sequences. Input to the filter is a 1280 by 1024 pixel grayscale ige using 8 bits per pixel. Only the tie to filter the ige in eory is counted, excluding disk I/O operations. The first group of perfornce figures are easured on the output of the copilers, with all optiizations turned on. The second group of figures are easured on nually optiized code, that takes into account that the sae values are ultiplied to different weights in successing steps of the filter algorith (see section 4.2). This Algoriths 68020, 16 MHz [s] Sparc 10, 50 MHz [s] MoM-3, 33 MHz [s] MergeSort a x3 2-D FIR b x5 2-D FIR b x7 2-D FIR b x3 2-D FIR c x5 2-D FIR c x7 2-D FIR c JPEG d Table 1. CPU tie required (in seconds) a. sort Bit integers in eory b * 1024 grayscale ige, 8 bits per pixel, in eory, copiler optiizations only c * 1024 grayscale ige, 8 bits per pixel, in eory, nually optiized d. IJG cjpeg copression, 704 * 576 RGB ige, 24 bits per pixel, excluding disk I/O 9 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

10 allows to reduce eory cycles by storing these values internally. Multiplications and accuulations are done in the sae DPU. The additional cost of the addition is weighed out far by the reduction of the required rdpa array size. Since all DPUs fit into a single chip in the 3 * 3 case, the coparably high cost (in execution tie) of the serial links can be avoided. The MoM-3 uses all seven C- Modules in parallel on overlapping strips of the input ige to speed up the filter operation. The JPEG copression algorith is based on the IJG progra cjpeg [9], inserting tie easuring code around the copression Algoriths MoM-3 vs , 16 MHz MoM-3 vs. Sparc 10/51 odern MoM-3 vs. Sparc 10/51 a MergeSort b 3x3 2-D FIR c d 5x5 2-D FIR c d 7x7 2-D FIR c d 3x3 2-D FIR e d 5x5 2-D FIR e f 7x7 2-D FIR e f JPEG Table 2. Speed-up Factors 23.6 d a. Sparc 10/51 vs. MoM-3 in Sparc s technology b. critical factor is eory access tie: 60 ns vs 120 ns DRAM access, 30 ns vs. 60 ns SRAM (register file) access, in the first stage of the algorith, the critical factor is the speed of the serial links, which changes to eory access tie for new technology, therefore ore than a factor of two additional speed-up c. copiler optiizations only d. critical factor is solely eory access tie: additional speed-up of two e. nually optiized f. critical factor is speed of serial links: 4 bit vs 2 bit with larger packages, and 66 MHz vs 33 MHz serial link clock, but only an additional speed-up of approxitely three, because with new technology critical factor is eory access tie kernel, after the input ige is read. To exclude disk output, output is redirected to /dev/null. A 704 by 576 RGB colour ige using 24 bits per pixel is used as input for the tie easureents. The JPEG copression algorith is adapted to the MoM-3 so that seven C-Modules operate in parallel. A ore detailed description of the MoM-3 ipleentation of this algorith can be found in [5]. The second colun of table 1 gives the CPU ties in seconds for the ELTEC host, which runs an MC68020 at 16 MHz. The third colun describes the perfornce of a SPARCstation 10/51 running at 50 MHz. The CPU ties of the conventional coputers were all based on copilations (GNU gcc) with all optiizations turned on, producing output specially adapted to exactly the kind of processor inside the coputer. The last colun indicates the tie to run the sae algorith on the MoM-3 prototype (33 MHz version). Table 2 illustrates the speed-up factors obtained copared to the hosting ELTEC workstation and a odern SUN Sparc 10/51. The last colun takes into account, that our prototype is built with an inferior technology copared to a odern Sparcstation. The notes at the botto of the table explain how the extra speed-up factors would be achieved. The rather inor speed-up for the MergeSort application stes fro the fact, that the sorting is not really coputation intensive and can only be parallelized and pipelined at the first stages. Most of the execution tie is spent in the erging process, where ost of the rdpa hardware is idle. The whole algorith is doinated by eory access tie, a reason why the Sparc 10 provides coparably sll speed-ups over the MC68020, as well. 6. Conclusions The MoM-3, a configurable accelerator has been presented, which provides acceleration factors of three orders of gnitude for its outdated host coputer, and speedups in the range of 12 to 62 when copared to a state of the art workstation. High acceleration factors can still be intained, when working fro a high level input language. The copilation is done fro an alost coplete subset of standard C language. Good results are achieved, without requiring user interaction or special copiler directives to generate code for field-prograble devices and the controlling hardware of the MoM-3. The custo designed rdpa circuit provides a coarse grain field-prograble hardware, especially suited for 32-bit datapaths for pipelined arithetic. A new sequencing paradig supports ultiple address generators and a loose coupling between sequencing echanis and ALU. The address generation under hardware control leaves all eory accesses free for data transfers, providing de facto a higher 10 (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

11 eory bandwidth fro the sae eory hardware. The loose coupling of the data sequencing paradig allows to fully exploit the benefits of reconfigurable coputing devices. The cobination of hardwired address generators and configurable devices is the key to speed up both data nipulations and data accesses - and still intain a prograing environent failiar to a conventional software developer. The MoM-3 prototype is currently being built. The address generators, the MoM-3 controller and the ralu control chip have just returned fro fabrication and are now under test. All three are integrated circuits based on 1.0 CMOS standard cells, a technology de available by the EUROCHIP project of the CEC. The rdpa circuits will be subitted to fabrication in a 0.7 CMOS standard cell process soon. All MoM-3 perfornce figures were obtained fro siulations of the copleted circuit designs, which were integrated into a siulation environent, describing the whole MoM-3 at functional level. [12] A. Razavi, I. Shenberg, D. Seltz, D. Fronczak: A High Perfornce JPEG Ige Copression Chip Set for Multiedia Applications; Proceedings of the SPIE - The International Society for Optical Engineering, vol.1903, p , 1993 [13] K. Schidt: A Restructuring Copiler for Xputers; Ph. D. Thesis, University of Kaiserslautern, 1994 References [1] J. M. Arnold, D. A. Buell, E. G. Davis: Splash-2; 4th Annual ACM Syposiu on Parallel Algoriths and Architectures (SPPA 92), pp , ACM Press, San Diego, CA, June/July 1992 [2] P. Bertin, D. Roncin, J. Vuillein: Introduction to Prograble Active Meories; DIGITAL PRL Research Report 3, DIGITAL Paris Research Laboratory, June 1989 [3] D. Bursky: Codec copresses iges in real tie; Electronic Design, vol. 41, no. 20, pp , Oct [4] S. Casseln: Virtual Coputing and The Virtual Coputer; IEEE Workshop on FPGAs for Custo Coputing Machines, FCCM'93, IEEE Coputer Society Press, Napa, CA, pp , April 1993 [5] R. W. Hartenstein, J. Becker, R. Kress, H. Reinig, K. Schidt: A Reconfigurable Machine for Applications in Ige and Video Copression; European Syposiu on Advanced Services and Networks / Conference on Copression Techniques and Standards for Ige and Video Counications, Asterda, March 1995 [6] R. W. Hartenstein, R. Kress, H. Reinig: A Reconfigurable Data-Driven ALU for Xputers; Proceedings of the IEEE Workshop on FPGAs for Custo Coputing Machines, FCCM'94, Napa, CA, April 1994 [7] R. W. Hartenstein, A. G. Hirschbiel, K. Schidt, M. Weber: A Novel Paradig of Parallel Coputation and its Use to Ipleent Siple High-Perfornce Hardware; Future Generation Systes 7 (1991/92), p , Elsevier Science Publishers, North-Holland, 1992 [8] Jaap Hollenberg: The CRAY-2 Coputer Syste; Supercoputer 8/9, pp , July/Septeber 1985 [9] T. G. Lane: cjpeg software, release 5; Independent JPEG Group (IJG), Sept [10] N.N.: The CHS2x4: The World s First Custo Coputer; Algotronix Ltd., 1991 [11] N.N.: The XC4000 Data Book; Xilinx, Inc., (of 11) and technical work on a noncoercial basis. Copyright and all rights therein are intained by the authors or by other copyright infortion will adhere to the ters and constraints invoked by each author's copyright. These works y not be reposted without

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

R.W. Hartenstein, et al.: A Reconfigurable Arithmetic Datapath Architecture; GI/ITG-Workshop, Schloß Dagstuhl, Bericht 303, pp.

R.W. Hartenstein, et al.: A Reconfigurable Arithmetic Datapath Architecture; GI/ITG-Workshop, Schloß Dagstuhl, Bericht 303, pp. # Algorithms Operations # of DPUs Time Steps per Operation Performance 1 1024 Fast Fourier Transformation *,, - 10 16. 10240 20 ms 2 FIR filter, n th order *, 2(n1) 15 1800 ns/data word 3 FIR filter, n

More information

Using the KressArray for Reconfigurable Computing

Using the KressArray for Reconfigurable Computing Using the KressArray for econfigurable Coputing einer Hartenstein, Michael Herz, Thoas Hoffann, Ulrich Nageldinger University of Kaiserslautern, Erwin-Schroedinger-Strasse, D-67663 Kaiserslautern, Gerany

More information

A Two-Level Hardware/Software Co-Design Framework for Automatic Accelerator Generation

A Two-Level Hardware/Software Co-Design Framework for Automatic Accelerator Generation A Two-Level Hardware/Software Co-Design Framework for Automatic Accelerator Generation Reiner W. Hartenstein, Jürgen Becker, Rainer Kress, Helmut Reinig, Karin Schmidt University of Kaiserslautern Erwin-Schrödinger-Straße,

More information

FPL Implementation of a SIMD RISC RNS-Enabled DSP

FPL Implementation of a SIMD RISC RNS-Enabled DSP FPL Ipleentation of a SIMD RISC RNS-Enabled DSP J. RAMÍREZ (1), A. GARCÍA (1), P. G. FERNÁNDEZ (2), A. LLORIS (1) (1) Dpt. of Electronics and Coputer Technology (2) Dpt. of Electrical Engineering University

More information

RECONFIGURABLE AND MODULAR BASED SYNTHESIS OF CYCLIC DSP DATA FLOW GRAPHS

RECONFIGURABLE AND MODULAR BASED SYNTHESIS OF CYCLIC DSP DATA FLOW GRAPHS RECONFIGURABLE AND MODULAR BASED SYNTHESIS OF CYCLIC DSP DATA FLOW GRAPHS AWNI ITRADAT Assistant Professor, Departent of Coputer Engineering, Faculty of Engineering, Hasheite University, P.O. Box 15459,

More information

Investigation of The Time-Offset-Based QoS Support with Optical Burst Switching in WDM Networks

Investigation of The Time-Offset-Based QoS Support with Optical Burst Switching in WDM Networks Investigation of The Tie-Offset-Based QoS Support with Optical Burst Switching in WDM Networks Pingyi Fan, Chongxi Feng,Yichao Wang, Ning Ge State Key Laboratory on Microwave and Digital Counications,

More information

Performance Analysis of RAID in Different Workload

Performance Analysis of RAID in Different Workload Send Orders for Reprints to reprints@benthascience.ae 324 The Open Cybernetics & Systeics Journal, 2015, 9, 324-328 Perforance Analysis of RAID in Different Workload Open Access Zhang Dule *, Ji Xiaoyun,

More information

Shortest Path Determination in a Wireless Packet Switch Network System in University of Calabar Using a Modified Dijkstra s Algorithm

Shortest Path Determination in a Wireless Packet Switch Network System in University of Calabar Using a Modified Dijkstra s Algorithm International Journal of Engineering and Technical Research (IJETR) ISSN: 31-869 (O) 454-4698 (P), Volue-5, Issue-1, May 16 Shortest Path Deterination in a Wireless Packet Switch Network Syste in University

More information

EE 364B Convex Optimization An ADMM Solution to the Sparse Coding Problem. Sonia Bhaskar, Will Zou Final Project Spring 2011

EE 364B Convex Optimization An ADMM Solution to the Sparse Coding Problem. Sonia Bhaskar, Will Zou Final Project Spring 2011 EE 364B Convex Optiization An ADMM Solution to the Sparse Coding Proble Sonia Bhaskar, Will Zou Final Project Spring 20 I. INTRODUCTION For our project, we apply the ethod of the alternating direction

More information

Evaluation of a multi-frame blind deconvolution algorithm using Cramér-Rao bounds

Evaluation of a multi-frame blind deconvolution algorithm using Cramér-Rao bounds Evaluation of a ulti-frae blind deconvolution algorith using Craér-Rao bounds Charles C. Beckner, Jr. Air Force Research Laboratory, 3550 Aberdeen Ave SE, Kirtland AFB, New Mexico, USA 87117-5776 Charles

More information

MAPPING THE DATA FLOW MODEL OF COMPUTATION INTO AN ENHANCED VON NEUMANN PROCESSOR * Peter M. Maurer

MAPPING THE DATA FLOW MODEL OF COMPUTATION INTO AN ENHANCED VON NEUMANN PROCESSOR * Peter M. Maurer MAPPING THE DATA FLOW MODEL OF COMPUTATION INTO AN ENHANCED VON NEUMANN PROCESSOR * Peter M. Maurer Departent of Coputer Science and Engineering University of South Florida Tapa, FL 33620 Abstract -- The

More information

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering Clustering Cluster Analysis of Microarray Data 4/3/009 Copyright 009 Dan Nettleton Group obects that are siilar to one another together in a cluster. Separate obects that are dissiilar fro each other into

More information

PERFORMANCE MEASURES FOR INTERNET SERVER BY USING M/M/m QUEUEING MODEL

PERFORMANCE MEASURES FOR INTERNET SERVER BY USING M/M/m QUEUEING MODEL IJRET: International Journal of Research in Engineering and Technology ISSN: 239-63 PERFORMANCE MEASURES FOR INTERNET SERVER BY USING M/M/ QUEUEING MODEL Raghunath Y. T. N. V, A. S. Sravani 2 Assistant

More information

EXAMPLES. Example_OC Ex_OC.asm Example_RPM_1 Ex_RPM_1.as m Example_RPM_2 Ex_RPM_2.as m

EXAMPLES. Example_OC Ex_OC.asm Example_RPM_1 Ex_RPM_1.as m Example_RPM_2 Ex_RPM_2.as m EMCH 367 Fundaentals of Microcontrollers Exaples EXAMPLES Exaples are ainly intended for the students to get started quickly with the basic concepts, prograing language, and hex/binary conventions. The

More information

A Low-cost Memory Architecture with NAND XIP for Mobile Embedded Systems

A Low-cost Memory Architecture with NAND XIP for Mobile Embedded Systems A Low-cost Meory Architecture with XIP for Mobile Ebedded Systes Chanik Park, Jaeyu Seo, Sunghwan Bae, Hyojun Ki, Shinhan Ki and Busoo Ki Software Center, SAMSUNG Electronics, Co., Ltd. Seoul 135-893,

More information

Ascending order sort Descending order sort

Ascending order sort Descending order sort Scalable Binary Sorting Architecture Based on Rank Ordering With Linear Area-Tie Coplexity. Hatrnaz and Y. Leblebici Departent of Electrical and Coputer Engineering Worcester Polytechnic Institute Abstract

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN International Journal of Scientific & Engineering Research, Volue 4, Issue 0, October-203 483 Design an Encoding Technique Using Forbidden Transition free Algorith to Reduce Cross-Talk for On-Chip VLSI

More information

Weeks 1 3 Weeks 4 6 Unit/Topic Number and Operations in Base 10

Weeks 1 3 Weeks 4 6 Unit/Topic Number and Operations in Base 10 Weeks 1 3 Weeks 4 6 Unit/Topic Nuber and Operations in Base 10 FLOYD COUNTY SCHOOLS CURRICULUM RESOURCES Building a Better Future for Every Child - Every Day! Suer 2013 Subject Content: Math Grade 3rd

More information

Trees. Linear vs. Branching CSE 143. Branching Structures in CS. What s in a Node? A Tree. [Chapter 10]

Trees. Linear vs. Branching CSE 143. Branching Structures in CS. What s in a Node? A Tree. [Chapter 10] CSE 143 Trees [Chapter 10] Linear vs. Branching Our data structures so far are linear Have a beginning and an end Everything falls in order between the ends Arrays, lined lists, queues, stacs, priority

More information

OPTIMAL COMPLEX SERVICES COMPOSITION IN SOA SYSTEMS

OPTIMAL COMPLEX SERVICES COMPOSITION IN SOA SYSTEMS Key words SOA, optial, coplex service, coposition, Quality of Service Piotr RYGIELSKI*, Paweł ŚWIĄTEK* OPTIMAL COMPLEX SERVICES COMPOSITION IN SOA SYSTEMS One of the ost iportant tasks in service oriented

More information

1 P a g e. F x,x...,x,.,.' written as F D, is the same.

1 P a g e. F x,x...,x,.,.' written as F D, is the same. 11. The security syste at an IT office is coposed of 10 coputers of which exactly four are working. To check whether the syste is functional, the officials inspect four of the coputers picked at rando

More information

Energy-Efficient Disk Replacement and File Placement Techniques for Mobile Systems with Hard Disks

Energy-Efficient Disk Replacement and File Placement Techniques for Mobile Systems with Hard Disks Energy-Efficient Disk Replaceent and File Placeent Techniques for Mobile Systes with Hard Disks Young-Jin Ki School of Coputer Science & Engineering Seoul National University Seoul 151-742, KOREA youngjk@davinci.snu.ac.kr

More information

I-0 Introduction. I-1 Introduction. Objectives: Quote:

I-0 Introduction. I-1 Introduction. Objectives: Quote: I-0 Introduction Objectives: Explain necessity of parallel/ultithreaded algoriths Describe different fors of parallel processing Present coonly used architectures Introduce a few basic ters Coents: Try

More information

A Generic Architecture for Programmable Trac. Shaper for High Speed Networks. Krishnan K. Kailas y Ashok K. Agrawala z. fkrish,

A Generic Architecture for Programmable Trac. Shaper for High Speed Networks. Krishnan K. Kailas y Ashok K. Agrawala z. fkrish, A Generic Architecture for Prograable Trac Shaper for High Speed Networks Krishnan K. Kailas y Ashok K. Agrawala z fkrish, agrawalag@cs.ud.edu y Departent of Electrical Engineering z Departent of Coputer

More information

Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic Demands

Oblivious Routing for Fat-Tree Based System Area Networks with Uncertain Traffic Demands Oblivious Routing for Fat-Tree Based Syste Area Networks with Uncertain Traffic Deands Xin Yuan Wickus Nienaber Zhenhai Duan Departent of Coputer Science Florida State University Tallahassee, FL 3306 {xyuan,nienaber,duan}@cs.fsu.edu

More information

Carving Differential Unit Test Cases from System Test Cases

Carving Differential Unit Test Cases from System Test Cases Carving Differential Unit Test Cases fro Syste Test Cases Sebastian Elbau, Hui Nee Chin, Matthew B. Dwyer, Jonathan Dokulil Departent of Coputer Science and Engineering University of Nebraska - Lincoln

More information

Geo-activity Recommendations by using Improved Feature Combination

Geo-activity Recommendations by using Improved Feature Combination Geo-activity Recoendations by using Iproved Feature Cobination Masoud Sattari Middle East Technical University Ankara, Turkey e76326@ceng.etu.edu.tr Murat Manguoglu Middle East Technical University Ankara,

More information

Scheduling Parallel Real-Time Recurrent Tasks on Multicore Platforms

Scheduling Parallel Real-Time Recurrent Tasks on Multicore Platforms IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL., NO., NOV 27 Scheduling Parallel Real-Tie Recurrent Tasks on Multicore Platfors Risat Pathan, Petros Voudouris, and Per Stenströ Abstract We

More information

A Novel Fast Constructive Algorithm for Neural Classifier

A Novel Fast Constructive Algorithm for Neural Classifier A Novel Fast Constructive Algorith for Neural Classifier Xudong Jiang Centre for Signal Processing, School of Electrical and Electronic Engineering Nanyang Technological University Nanyang Avenue, Singapore

More information

A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing

A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing A Trajectory Splitting Model for Efficient Spatio-Teporal Indexing Slobodan Rasetic Jörg Sander Jaes Elding Mario A. Nasciento Departent of Coputing Science University of Alberta Edonton, Alberta, Canada

More information

Design Optimization of Mixed Time/Event-Triggered Distributed Embedded Systems

Design Optimization of Mixed Time/Event-Triggered Distributed Embedded Systems Design Optiization of Mixed Tie/Event-Triggered Distributed Ebedded Systes Traian Pop, Petru Eles, Zebo Peng Dept. of Coputer and Inforation Science, Linköping University {trapo, petel, zebpe}@ida.liu.se

More information

QUERY ROUTING OPTIMIZATION IN SENSOR COMMUNICATION NETWORKS

QUERY ROUTING OPTIMIZATION IN SENSOR COMMUNICATION NETWORKS QUERY ROUTING OPTIMIZATION IN SENSOR COMMUNICATION NETWORKS Guofei Jiang and George Cybenko Institute for Security Technology Studies and Thayer School of Engineering Dartouth College, Hanover NH 03755

More information

A High-Speed VLSI Fuzzy Inference Processor for Trapezoid-Shaped Membership Functions *

A High-Speed VLSI Fuzzy Inference Processor for Trapezoid-Shaped Membership Functions * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 21, 607-626 (2005) A High-Speed VLSI Fuzzy Inference Processor for Trapezoid-Shaped Mebership Functions * SHIH-HSU HUANG AND JIAN-YUAN LAI + Departent of

More information

MATRIX CALCULATION BACKWARD CHAINING IN RULE BASED EXPERT SYSTEM

MATRIX CALCULATION BACKWARD CHAINING IN RULE BASED EXPERT SYSTEM 4 th Research/Expert Conference with International Participation QUAITY 5, Fojnica, B&H, Noveber 9 -, 5 ATRIX CACUATION BACKWARD CHAINING IN RUE BASED EXPERT SYSTE sc aro Hell Universit of Split, Facult

More information

Computer Aided Drafting, Design and Manufacturing Volume 26, Number 2, June 2016, Page 13

Computer Aided Drafting, Design and Manufacturing Volume 26, Number 2, June 2016, Page 13 Coputer Aided Drafting, Design and Manufacturing Volue 26, uber 2, June 2016, Page 13 CADDM 3D reconstruction of coplex curved objects fro line drawings Sun Yanling, Dong Lijun Institute of Mechanical

More information

Grading Results Total 100

Grading Results Total 100 University of California, Berkeley College of Engineering Departent of Electrical Engineering and Coputer Sciences Fall 2003 Instructor: Dave Patterson 2003-11-19 v1.9 CS 152 Exa #2 Solutions Personal

More information

Enhancing Real-Time CAN Communications by the Prioritization of Urgent Messages at the Outgoing Queue

Enhancing Real-Time CAN Communications by the Prioritization of Urgent Messages at the Outgoing Queue Enhancing Real-Tie CAN Counications by the Prioritization of Urgent Messages at the Outgoing Queue ANTÓNIO J. PIRES (1), JOÃO P. SOUSA (), FRANCISCO VASQUES (3) 1,,3 Faculdade de Engenharia da Universidade

More information

CS 361 Meeting 8 9/24/18

CS 361 Meeting 8 9/24/18 CS 36 Meeting 8 9/4/8 Announceents. Hoework 3 due Friday. Review. The closure properties of regular languages provide a way to describe regular languages by building the out of sipler regular languages

More information

IN recent years, due to the widespread use of the orthogonal. Design of an 8192-point Sequential I/O FFT Chip. Yun-Nan Chang, Member, IEEE

IN recent years, due to the widespread use of the orthogonal. Design of an 8192-point Sequential I/O FFT Chip. Yun-Nan Chang, Member, IEEE Proceedings of the World Congress on Engineering and Coputer Science 1 Vol II WCECS 1, October -6, 1, San Francisco, USA esign of an 819-point Sequential I/O FFT Chip Yun-Nan Chang, Meber, IEEE Abstract

More information

Structural Balance in Networks. An Optimizational Approach. Andrej Mrvar. Faculty of Social Sciences. University of Ljubljana. Kardeljeva pl.

Structural Balance in Networks. An Optimizational Approach. Andrej Mrvar. Faculty of Social Sciences. University of Ljubljana. Kardeljeva pl. Structural Balance in Networks An Optiizational Approach Andrej Mrvar Faculty of Social Sciences University of Ljubljana Kardeljeva pl. 5 61109 Ljubljana March 23 1994 Contents 1 Balanced and clusterable

More information

Two-level Partitioning of Image Processing Algorithms for the Parallel Map-oriented Machine

Two-level Partitioning of Image Processing Algorithms for the Parallel Map-oriented Machine Two-level Partitioning of Image Processing Algorithms for the Parallel Map-oriented Machine Reiner W. Hartenstein, Jürgen Becker, Rainer Kress University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663

More information

Page 1. Embedded (Computer) System. The User s View of a Computer. Four Views of a Computer. Machine/assembly Language Programmer s View

Page 1. Embedded (Computer) System. The User s View of a Computer. Four Views of a Computer. Machine/assembly Language Programmer s View Four Views of a Coputer The User s View of a Coputer The user s view The prograer s view The architect s view The hardware designer s view The user sees for factors, software, speed, storage capacity,

More information

A Research of MapReduce with GPU Acceleration

A Research of MapReduce with GPU Acceleration A Research of MapReduce with GPU Acceleration Miao Xin 1, Hao Li 2, Joan Lu 2 1 School of Inforation Science and Engineering, Yunnan University, Kuning, Yunnan, China 2 School of Software, Yunnan University,

More information

Detection of Outliers and Reduction of their Undesirable Effects for Improving the Accuracy of K-means Clustering Algorithm

Detection of Outliers and Reduction of their Undesirable Effects for Improving the Accuracy of K-means Clustering Algorithm Detection of Outliers and Reduction of their Undesirable Effects for Iproving the Accuracy of K-eans Clustering Algorith Bahan Askari Departent of Coputer Science and Research Branch, Islaic Azad University,

More information

Four Views of a Computer. The User s View of a Computer

Four Views of a Computer. The User s View of a Computer Four Views of a Coputer The user ss view The prograer s view The architect s view The hardware designer s view The User s View of a Coputer The user sees for factors, software, speed, storage capacity,

More information

TensorFlow and Keras-based Convolutional Neural Network in CAT Image Recognition Ang LI 1,*, Yi-xiang LI 2 and Xue-hui LI 3

TensorFlow and Keras-based Convolutional Neural Network in CAT Image Recognition Ang LI 1,*, Yi-xiang LI 2 and Xue-hui LI 3 2017 2nd International Conference on Coputational Modeling, Siulation and Applied Matheatics (CMSAM 2017) ISBN: 978-1-60595-499-8 TensorFlow and Keras-based Convolutional Neural Network in CAT Iage Recognition

More information

The optimization design of microphone array layout for wideband noise sources

The optimization design of microphone array layout for wideband noise sources PROCEEDINGS of the 22 nd International Congress on Acoustics Acoustic Array Systes: Paper ICA2016-903 The optiization design of icrophone array layout for wideband noise sources Pengxiao Teng (a), Jun

More information

Defining and Surveying Wireless Link Virtualization and Wireless Network Virtualization

Defining and Surveying Wireless Link Virtualization and Wireless Network Virtualization 1 Defining and Surveying Wireless Link Virtualization and Wireless Network Virtualization Jonathan van de Belt, Haed Ahadi, and Linda E. Doyle The Centre for Future Networks and Counications - CONNECT,

More information

Predicting x86 Program Runtime for Intel Processor

Predicting x86 Program Runtime for Intel Processor Predicting x86 Progra Runtie for Intel Processor Behra Mistree, Haidreza Haki Javadi, Oid Mashayekhi Stanford University bistree,hrhaki,oid@stanford.edu 1 Introduction Progras can be equivalent: given

More information

Image Filter Using with Gaussian Curvature and Total Variation Model

Image Filter Using with Gaussian Curvature and Total Variation Model IJECT Vo l. 7, Is s u e 3, Ju l y - Se p t 016 ISSN : 30-7109 (Online) ISSN : 30-9543 (Print) Iage Using with Gaussian Curvature and Total Variation Model 1 Deepak Kuar Gour, Sanjay Kuar Shara 1, Dept.

More information

Solving the Damage Localization Problem in Structural Health Monitoring Using Techniques in Pattern Classification

Solving the Damage Localization Problem in Structural Health Monitoring Using Techniques in Pattern Classification Solving the Daage Localization Proble in Structural Health Monitoring Using Techniques in Pattern Classification CS 9 Final Project Due Dec. 4, 007 Hae Young Noh, Allen Cheung, Daxia Ge Introduction Structural

More information

A simplified approach to merging partial plane images

A simplified approach to merging partial plane images A siplified approach to erging partial plane iages Mária Kruláková 1 This paper introduces a ethod of iage recognition based on the gradual generating and analysis of data structure consisting of the 2D

More information

Module Contact: Dr Rudy Lapeer (CMP) Copyright of the University of East Anglia Version 1

Module Contact: Dr Rudy Lapeer (CMP) Copyright of the University of East Anglia Version 1 UNIVERSITY OF EAST ANGLIA School of Coputing Sciences Main Series UG Exaination 2016-17 GRAPHICS 1 CMP-5010B Tie allowed: 2 hours Answer THREE questions. Notes are not peritted in this exaination Do not

More information

Modeling Parallel Applications Performance on Heterogeneous Systems

Modeling Parallel Applications Performance on Heterogeneous Systems Modeling Parallel Applications Perforance on Heterogeneous Systes Jaeela Al-Jaroodi, Nader Mohaed, Hong Jiang and David Swanson Departent of Coputer Science and Engineering University of Nebraska Lincoln

More information

THE rounding operation is performed in almost all arithmetic

THE rounding operation is performed in almost all arithmetic This is the author's version of an article that has been published in this journal. Changes were ade to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/.9/tvlsi.5.538

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Example: Computer, SRC Simple RISC Computer

Example: Computer, SRC Simple RISC Computer 2-24 Chapter 2 Machines, Machine Languages, and Digital Logic Exaple: Coputer, SRC Siple RISC Coputer 32 general purpose registers of 32 bits 32-bit progra counter, PC, and instruction register, IR 2 32

More information

Theoretical Analysis of Local Search and Simple Evolutionary Algorithms for the Generalized Travelling Salesperson Problem

Theoretical Analysis of Local Search and Simple Evolutionary Algorithms for the Generalized Travelling Salesperson Problem Theoretical Analysis of Local Search and Siple Evolutionary Algoriths for the Generalized Travelling Salesperson Proble Mojgan Pourhassan ojgan.pourhassan@adelaide.edu.au Optiisation and Logistics, The

More information

HIGH PERFORMANCE PRE-SEGMENTATION ALGORITHM FOR SONAR IMAGES

HIGH PERFORMANCE PRE-SEGMENTATION ALGORITHM FOR SONAR IMAGES HIGH PERFORMANCE PRE-SEGMENTATION ALGORITHM FOR SONAR IMAGES Benjain Lehann*, Konstantinos Siantidis*, Dieter Kraus** *ATLAS ELEKTRONIK GbH Sebaldsbrücker Heerstraße 235 D-28309 Breen, GERMANY Eail: benjain.lehann@atlas-elektronik.co

More information

An Efficient Approach for Content Delivery in Overlay Networks

An Efficient Approach for Content Delivery in Overlay Networks An Efficient Approach for Content Delivery in Overlay Networks Mohaad Malli, Chadi Barakat, Walid Dabbous Projet Planète, INRIA-Sophia Antipolis, France E-ail:{alli, cbarakat, dabbous}@sophia.inria.fr

More information

arxiv: v1 [cs.dc] 3 Oct 2018

arxiv: v1 [cs.dc] 3 Oct 2018 Sparse Winograd Convolutional neural networks on sall-scale systolic arrays Feng Shi, Haochen Li, Yuhe Gao, Benjain Kuschner, Song-Chun Zhu University of California Los Angeles Los Angeles, USA {shi.feng,sczh}@cs.ucla.edu

More information

An Ad Hoc Adaptive Hashing Technique for Non-Uniformly Distributed IP Address Lookup in Computer Networks

An Ad Hoc Adaptive Hashing Technique for Non-Uniformly Distributed IP Address Lookup in Computer Networks An Ad Hoc Adaptive Hashing Technique for Non-Uniforly Distributed IP Address Lookup in Coputer Networks Christopher Martinez Departent of Electrical and Coputer Engineering The University of Texas at San

More information

A Fast Multi-Objective Genetic Algorithm for Hardware-Software Partitioning In Embedded System Design

A Fast Multi-Objective Genetic Algorithm for Hardware-Software Partitioning In Embedded System Design A Fast Multi-Obective Genetic Algorith for Hardware-Software Partitioning In Ebedded Syste Design 1 M.Jagadeeswari, 2 M.C.Bhuvaneswari 1 Research Scholar, P.S.G College of Technology, Coibatore, India

More information

Analysing Real-Time Communications: Controller Area Network (CAN) *

Analysing Real-Time Communications: Controller Area Network (CAN) * Analysing Real-Tie Counications: Controller Area Network (CAN) * Abstract The increasing use of counication networks in tie critical applications presents engineers with fundaental probles with the deterination

More information

Privacy-preserving String-Matching With PRAM Algorithms

Privacy-preserving String-Matching With PRAM Algorithms Privacy-preserving String-Matching With PRAM Algoriths Report in MTAT.07.022 Research Seinar in Cryptography, Fall 2014 Author: Sander Sii Supervisor: Peeter Laud Deceber 14, 2014 Abstract In this report,

More information

Scope-aware Data Cache Analysis for WCET Estimation

Scope-aware Data Cache Analysis for WCET Estimation Scope-aware Data Cache Analysis for WCET Estiation Bach Khoa Huynh National University of Singapore huynhbac@cop.nus.edu.sg Lei Ju Shandong University julei@sdu.edu.cn (Corresponding author) Abhik Roychoudhury

More information

Smarter Balanced Assessment Consortium Claims, Targets, and Standard Alignment for Math

Smarter Balanced Assessment Consortium Claims, Targets, and Standard Alignment for Math Sarter Balanced Assessent Consortiu s, s, Stard Alignent for Math The Sarter Balanced Assessent Consortiu (SBAC) has created a hierarchy coprised of clais targets that together can be used to ake stateents

More information

Zone-Based Shortest Positioning Time First Scheduling for MEMS-Based Storage Devices

Zone-Based Shortest Positioning Time First Scheduling for MEMS-Based Storage Devices Zone-Based Shortest Positioning Tie First Scheduling for MEMS-Based Storage Devices Bo Hong, Scott A. Brandt, Darrell D. E. Long, Ethan L. Miller, Karen A. Glocer and Zachary N. J. Peterson Storage Systes

More information

Multiple operand addition and multiplication

Multiple operand addition and multiplication Multiple operand addition and ultiplication by SHANKER SINGH and RONALD WAXMAN International Business Machines Corporation Poughkeepsie, N ew York INTRODUCTION Traditionally, adders used in sall- and ediu-sized

More information

Design and Implementation of an Acyclic Stable Matching Scheduler

Design and Implementation of an Acyclic Stable Matching Scheduler Design and Ipleentation of an Acyclic Stable Matching Scheduler Enyue Lu Mei Yang Yi Zhang ands.q.zheng Dept. of Coputer Science Dept. of Coputer Science Dept. of Electrical Engineering University of Texas

More information

Vision Based Mobile Robot Navigation System

Vision Based Mobile Robot Navigation System International Journal of Control Science and Engineering 2012, 2(4): 83-87 DOI: 10.5923/j.control.20120204.05 Vision Based Mobile Robot Navigation Syste M. Saifizi *, D. Hazry, Rudzuan M.Nor School of

More information

Deterministic Voting in Distributed Systems Using Error-Correcting Codes

Deterministic Voting in Distributed Systems Using Error-Correcting Codes IEEE TRASACTIOS O PARALLEL AD DISTRIBUTED SYSTEMS, VOL. 9, O. 8, AUGUST 1998 813 Deterinistic Voting in Distributed Systes Using Error-Correcting Codes Lihao Xu and Jehoshua Bruck, Senior Meber, IEEE Abstract

More information

Recursion-Driven Parallel Code Generation for Multi-Core Platforms

Recursion-Driven Parallel Code Generation for Multi-Core Platforms Recursion-Driven Parallel Generation for Multi- Platfors Rebecca L. Collins, Bharadwaj Vellore, and Luca P. Carloni Departent of Coputer Science, Colubia University, New York, NY 10027 Eail: {rlc2119,vrb2102,luca}@cs.colubia.edu

More information

ECE ECE4680

ECE ECE4680 ECE468. -4-7 The otivation for s System ECE468 Computer Organization and Architecture DRA Hierarchy System otivation Large memories (DRA) are slow Small memories (SRA) are fast ake the average access time

More information

Smarter Balanced Assessment Consortium Claims, Targets, and Standard Alignment for Math

Smarter Balanced Assessment Consortium Claims, Targets, and Standard Alignment for Math Sarter Balanced Assessent Consortiu Clais, s, Stard Alignent for Math The Sarter Balanced Assessent Consortiu (SBAC) has created a hierarchy coprised of clais targets that together can be used to ake stateents

More information

A Beam Search Method to Solve the Problem of Assignment Cells to Switches in a Cellular Mobile Network

A Beam Search Method to Solve the Problem of Assignment Cells to Switches in a Cellular Mobile Network A Bea Search Method to Solve the Proble of Assignent Cells to Switches in a Cellular Mobile Networ Cassilda Maria Ribeiro Faculdade de Engenharia de Guaratinguetá - DMA UNESP - São Paulo State University

More information

Gromov-Hausdorff Distance Between Metric Graphs

Gromov-Hausdorff Distance Between Metric Graphs Groov-Hausdorff Distance Between Metric Graphs Jiwon Choi St Mark s School January, 019 Abstract In this paper we study the Groov-Hausdorff distance between two etric graphs We copute the precise value

More information

Technical Information For the Macintosh Performa 400 series

Technical Information For the Macintosh Performa 400 series Technical Inforation For the Macintosh Perfora 400 series Specifications Processor Macintosh Perfora 410 MC68020, 32-bit architecture, 16 egahertz (MHz) clock frequency Macintosh Perfora 460, 466, and

More information

THE rapid growth and continuous change of the real

THE rapid growth and continuous change of the real IEEE TRANSACTIONS ON SERVICES COMPUTING, VOL. 8, NO. 1, JANUARY/FEBRUARY 2015 47 Designing High Perforance Web-Based Coputing Services to Proote Teleedicine Database Manageent Syste Isail Hababeh, Issa

More information

Region Segmentation Region Segmentation

Region Segmentation Region Segmentation /7/ egion Segentation Lecture-7 Chapter 3, Fundaentals of Coputer Vision Alper Yilaz,, Mubarak Shah, Fall UCF egion Segentation Alper Yilaz,, Mubarak Shah, Fall UCF /7/ Laer epresentation Applications

More information

Designing High Performance Web-Based Computing Services to Promote Telemedicine Database Management System

Designing High Performance Web-Based Computing Services to Promote Telemedicine Database Management System Designing High Perforance Web-Based Coputing Services to Proote Teleedicine Database Manageent Syste Isail Hababeh 1, Issa Khalil 2, and Abdallah Khreishah 3 1: Coputer Engineering & Inforation Technology,

More information

A Broadband Spectrum Sensing Algorithm in TDCS Based on ICoSaMP Reconstruction

A Broadband Spectrum Sensing Algorithm in TDCS Based on ICoSaMP Reconstruction MATEC Web of Conferences 73, 03073(08) https://doi.org/0.05/atecconf/087303073 SMIMA 08 A roadband Spectru Sensing Algorith in TDCS ased on I Reconstruction Liu Yang, Ren Qinghua, Xu ingzheng and Li Xiazhao

More information

Adaptive Holistic Scheduling for In-Network Sensor Query Processing

Adaptive Holistic Scheduling for In-Network Sensor Query Processing Adaptive Holistic Scheduling for In-Network Sensor Query Processing Hejun Wu and Qiong Luo Departent of Coputer Science and Engineering Hong Kong University of Science & Technology Clear Water Bay, Kowloon,

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Collaborative Web Caching Based on Proxy Affinities

Collaborative Web Caching Based on Proxy Affinities Collaborative Web Caching Based on Proxy Affinities Jiong Yang T J Watson Research Center IBM jiyang@usibco Wei Wang T J Watson Research Center IBM ww1@usibco Richard Muntz Coputer Science Departent UCLA

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Data & Knowledge Engineering

Data & Knowledge Engineering Data & Knowledge Engineering 7 (211) 17 187 Contents lists available at ScienceDirect Data & Knowledge Engineering journal hoepage: www.elsevier.co/locate/datak An approxiate duplicate eliination in RFID

More information

6.1 Topological relations between two simple geometric objects

6.1 Topological relations between two simple geometric objects Chapter 5 proposed a spatial odel to represent the spatial extent of objects in urban areas. The purpose of the odel, as was clarified in Chapter 3, is ultifunctional, i.e. it has to be capable of supplying

More information

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science.

Colorado School of Mines. Computer Vision. Professor William Hoff Dept of Electrical Engineering &Computer Science. Professor Willia Hoff Dept of Electrical Engineering &Coputer Science http://inside.ines.edu/~whoff/ 1 Caera Calibration 2 Caera Calibration Needed for ost achine vision and photograetry tasks (object

More information

Design and Analysis of Online Arithmetic Operators for Streaming Data in FPGAs

Design and Analysis of Online Arithmetic Operators for Streaming Data in FPGAs Design and Analysis of Online Arithetic Operators for Streaing Data in FPGAs Georgina Binoy Joseph Centre for Excellence in VLSI Design, KCG College of Technology Chennai, India. R.Devanathan Hindustan

More information

Optimal Route Queries with Arbitrary Order Constraints

Optimal Route Queries with Arbitrary Order Constraints IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.?, NO.?,? 20?? 1 Optial Route Queries with Arbitrary Order Constraints Jing Li, Yin Yang, Nikos Maoulis Abstract Given a set of spatial points DS,

More information

NON-RIGID OBJECT TRACKING: A PREDICTIVE VECTORIAL MODEL APPROACH

NON-RIGID OBJECT TRACKING: A PREDICTIVE VECTORIAL MODEL APPROACH NON-RIGID OBJECT TRACKING: A PREDICTIVE VECTORIAL MODEL APPROACH V. Atienza; J.M. Valiente and G. Andreu Departaento de Ingeniería de Sisteas, Coputadores y Autoática Universidad Politécnica de Valencia.

More information

PROBABILISTIC LOCALIZATION AND MAPPING OF MOBILE ROBOTS IN INDOOR ENVIRONMENTS WITH A SINGLE LASER RANGE FINDER

PROBABILISTIC LOCALIZATION AND MAPPING OF MOBILE ROBOTS IN INDOOR ENVIRONMENTS WITH A SINGLE LASER RANGE FINDER nd International Congress of Mechanical Engineering (COBEM 3) Noveber 3-7, 3, Ribeirão Preto, SP, Brazil Copyright 3 by ABCM PROBABILISTIC LOCALIZATION AND MAPPING OF MOBILE ROBOTS IN INDOOR ENVIRONMENTS

More information

AGV PATH PLANNING BASED ON SMOOTHING A* ALGORITHM

AGV PATH PLANNING BASED ON SMOOTHING A* ALGORITHM International Journal of Software Engineering & Applications (IJSEA), Vol.6, No.5, Septeber 205 AGV PATH PLANNING BASED ON SMOOTHING A* ALGORITHM Xie Yang and Cheng Wushan College of Mechanical Engineering,

More information

BP-MAC: A High Reliable Backoff Preamble MAC Protocol for Wireless Sensor Networks

BP-MAC: A High Reliable Backoff Preamble MAC Protocol for Wireless Sensor Networks BP-MAC: A High Reliable Backoff Preable MAC Protocol for Wireless Sensor Networks Alexander Klein* Jirka Klaue Josef Schalk EADS Innovation Works, Gerany *Eail: klein@inforatik.uni-wuerzburg.de ABSTRACT:

More information

Medical Biophysics 302E/335G/ st1-07 page 1

Medical Biophysics 302E/335G/ st1-07 page 1 Medical Biophysics 302E/335G/500 20070109 st1-07 page 1 STEREOLOGICAL METHODS - CONCEPTS Upon copletion of this lesson, the student should be able to: -define the ter stereology -distinguish between quantitative

More information

Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues

Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues Mapping Data in Peer-to-Peer Systes: Seantics and Algorithic Issues Anastasios Keentsietsidis Marcelo Arenas Renée J. Miller Departent of Coputer Science University of Toronto {tasos,arenas,iller}@cs.toronto.edu

More information

Chapter Seven. Large & Fast: Exploring Memory Hierarchy

Chapter Seven. Large & Fast: Exploring Memory Hierarchy Chapter Seven Large & Fast: Exploring Memory Hierarchy 1 Memories: Review SRAM (Static Random Access Memory): value is stored on a pair of inverting gates very fast but takes up more space than DRAM DRAM

More information

Chapter 13 Reduced Instruction Set Computers

Chapter 13 Reduced Instruction Set Computers Chapter 13 Reduced Instruction Set Computers Contents Instruction execution characteristics Use of a large register file Compiler-based register optimization Reduced instruction set architecture RISC pipelining

More information