DMA-based Prefetching for I/O-Intensive Workloads on the Cell Architecture

Size: px
Start display at page:

Download "DMA-based Prefetching for I/O-Intensive Workloads on the Cell Architecture"

Transcription

1 DMA-base Prefetching for I/O-Intensive Workloas on the Cell Architecture M. Mustafa Rafique, Ali R. Butt, an Dimitrios S. Nikolopoulos Dept. of Computer Science, Virginia Tech Blacksburg, Virginia, USA {mustafa, butta, ABSTRACT Recent avent of the asymmetric multi-core processors such as Cell Broaban Engine (Cell/BE) has popularize the use of heterogeneous architectures. A growing boy of research is exploring the use of such architectures, especially in High-En Computing, for supporting scientific applications. However, prior research has focuse on use of the available Cell/BE operating systems an runtime environments for supporting compute-intensive jobs. Data an I/O intensive workloas have largely been ignore in this omain. In this paper, we take the first steps in supporting I/O intensive workloas on the Cell/BE an eriving guielines for optimizing the execution of I/O workloas on heterogeneous architectures. We explore various performance enhancing techniques for such workloas on an actual Cell/BE system. Among the techniques we explore, an asynchronous prefetching-base approach, which uses the PowerPC core of the Cell/BE for file prefetching an ecentralize DMAs from the synergistic processing cores (SPE s), improves the performance for I/O workloas that inclue an encryption/ecryption component by 22.2%, compare to I/O performe naïvely from the SPE s. Our evaluation shows promising results an lays the founation for eveloping more efficient I/O support libraries for multi-core asymmetric architectures. Categories an Subject Descriptors C.1.2 [Processor Architecture]: Multiple Data Stream Architecture; D.4.4 [Operating Systems]: Input/output General Terms Design, Experimentation, Performance Keywors Cell Broaban Engine, High-Performance Computing, I/O Intensive Workloas Permission to make igital or har copies of all or part of this work for personal or classroom use is grante without fee provie that copies are not mae or istribute for profit or commercial avantage an that copies bear this notice an the full citation on the first page. To copy otherwise, to republish, to post on servers or to reistribute to lists, requires prior specific permission an/or a fee. CF 08, May 5 7, 2008, Ischia, Italy. Copyright 2008 ACM /08/05...$ INTRODUCTION Asymmetric multi-core processors are wiely regare as a viable path to sustaining high performance, without compromising reliability. Given a fixe transistor buget, asymmetric multi-core processors invest heavily on many simple, tightly couple, accelerator-type cores. These cores are typically esigne with custom Instruction Set Architectures (ISAs) an features that enable acceleration of computational kernels operating on vector ata. Researchers have collecte mounting evience on the superiority of asymmetric multi-core processors in terms of performance, scalability, an power-efficiency [5, 24, 40, 42, 43]. Application-specific asymmetric multi-core architectures have previously been use extensively in network processors [51]. The recent avent of the Cell Broaban Engine (Cell/BE) processor [45] as a high-performance computing an ata processing engine [3, 7, 8, 19, 23, 41, 48], further attests to the potential of emerging asymmetric multi-core architectures. This paper explores the use of the Cell/BE, arguably a ominant asymmetric multi-core processor, in I/O-intensive applications. With moern high-performance computing applications generating an processing exponentially increasing amounts of ata, the scalable parallel processing capabilities, large on-chip ata transfer banwith, an aggressive latency overlap mechanisms of the Cell/BE rener it an attractive platform for high-performance I/O. Although several recent efforts have emonstrate the potential of the Cell/BE for high-spee computation on ata stage through its accelerator cores (SPE s) [19, 23], there is little unerstaning of how I/O operations interact with the architecture of the Cell/BE. The implications of such Cell/BE characteristics as asymmetry, DMA latency overlap, an software management of isjoint aress spaces, on the esign an implementation of the I/O software stack have not been explore. This paper aresses these important questions an makes the following contributions: A stuy of the I/O path in the currently available Cell/BE operating system an in the accelerator support library; An exploration of various alternative I/O methos that can be applie to the Cell/BE architecture; An investigation of the impact of ata prefetching techniques on improving I/O performance for the Cell/BE architecture; an An evaluation an recommenation of appropriate methos for hanling I/O intensive workloas. 23

2 Our evaluation reveals that allowing iniviual accelerators to perform irect I/O faces the bottleneck of all I/O requests route through the main core. Thus, we argue that (i) if the current OS an library support is not to be extene, the most efficient technique of performing I/O is to allow the main core to pre-stage (prefetch) the ata for the accelerator cores, an (ii) the performance can be improve if the accelerator support library is extene to o irect I/O, hence removing the sai bottleneck. We explore several I/O optimization schemes on the Cell/BE, involving prefetching an staging of ata between cores. An asynchronous prefetching scheme which combines prefetching from the PowerPC core (PPE) with asynchronous DMAs from the synergistic processing cores (SPE s), improves the performance of I/O workloas by up to 22.2%, compare to naïve I/O from the SPE s. We also re-affirm the intuition that the Cell SPE s have significant acceleration capabilities, which can be leverage in compute-intensive components of I/O software stacks, such as encryption/ecryption an compression. The rest of this paper is organize as follows. Section 2 provies backgroun an motivation for the research presente in the paper. Section 3 escribes the Cell/BE architecture in etail. Section 4 escribes our experimental setting an workloas, followe by a presentation an evaluation of several schemes to improve I/O performance on the Cell/BE. Section 5 iscusses relate work. Section 6 conclues the paper. 2. MOTIVATION AND BACKGROUND In this section, we escribe the backgroun of this work, an outline the enabling technologies for this research. We are concerne with the implementation of efficient I/O schemes for ata-intensive applications on asymmetric multi-core processors. We assume processors with heterogeneous cores, heterogeneous ISAs, an isjoint aress spaces between cores of ifferent technology. This organization provies for a simplifie harware esign which supports high raw computational spee an ata transfer banwith, at the cost of increase programming complexity. Applications leverage the processor by offloaing their time-consuming computational kernels to accelerator-type cores an by using the on-chip interconnection network to efficiently stage ( stream ) ata to the local storage space of the accelerators. Clearly, acceleration capabilities are relevant to I/Ointensive applications with significant I/O processing components such as encryption an compression. Besies acceleration of vector ata processing, the esign of the I/O subsystem on asymmetric multi-core processor merits further investigation. A esign consieration of particular importance is the istribution of the I/O processing path between the cores of the processor. Current esigns run the operating system on the conventional host cores (e.g. the PowerPC PPE of the Cell/BE) of the processors an route all I/O requests mae from the accelerator cores (e.g. the SPE s of the Cell/BE) through the host cores. While this esign simplifies the system software architecture it imposes bottlenecks. In particular, parallel I/O from the accelerator-type cores, which are typically many more than the host cores, may suffer from serialization an queuing at the host cores. In contrast to conventional processors, asymmetric multicore processors elegate more control of the memory hierar- Memory Flex I/O Power Processing Element L2 Cache Power Processing L1 Unit Cache Execution Unit Memory Interface Controller Bus Interface Controller Element Interconnect Bus Figure 1: Cell Broaban Engine system architecture. chy to software. The Cell/BE maps the local store of each accelerator core to the virtual aress space, an enables irect transfers to an from the local stores of any core, through a DMA mechanism. The DMA mechanism further enables overlap of multiple DMA requests with computation on each core. This capability extens naturally to the I/O subsystem, which shoul properly stage ata from the isk, to off-chip memory, to on-chip memory, so that the non-overlappe ata transfer latency is minimize. The esign space for ata staging in the I/O system involves tuning of the unit of ata transfer between layers of the memory hierarchy, synchronous an asynchronous prefetching algorithms to stage ata timely in the local store of accelerator cores for further processing, an synchronization an communication mechanisms between host cores an accelerator cores to coorinate I/O requests. 3. CELLARCHITECTURE We now present etails of the Cell/BE architecture. Given the focus of this work, we also escribe the path that is taken by application I/O requests before being elivere to the isk. Figure 1 shows the high level system components making up the Cell/BE, namely, the PowerPC Processor Element, a number of Elements, the Memory Interface Controller, two non-coherent I/O Interfaces, an the Element Interconnect Bus. 3.1 PowerPC Processor Element The execution controller of the Cell/BE is a general purpose 64-bit PowerPC Processor Element (PPE), currently operating at 3.2 GHz [25, 29]. The PPE is a ual issue, twoway multithreae core. The PPE boasts a 32 KB instruction an 32 KB ata Level-1 cache, an a 512 KB Level-2 cache. The PPE can theoretically execute two ouble precision or eight single precision operations per clock cycle. In current installations, the PPE runs Linux, with Cell/BEspecific extensions that provie access to the acceleratortype cores of the processor to user-space libraries. The PPE itself implements superscalar architecture, out-of-orer an SIMD (AltiVec) instruction execution, as well as nonblocking caches [45]. 24

3 3.2 Elements The intene use of the PPE on the Cell/BE is mainly execution control, running the operating system an supporting legacy applications. The major portion of computational workloa is hanle by multiple Elements (SPE s) [22]. SPE s are optimize vector engines an operate at the global processor frequency of 3.2 GHz. Each SPE consists of a Unit (SPU), an a Memory Flow Controller (MFC). Each SPE also has an embee software-manage SRAM referre to as Local Store. The local store is similar to scratch-pa memory. The SPE has exclusive access to its local store, an the local store hols both the executable running on the SPE an the ata neee by the executable. SPE s can access external DRAM an memory-mappe remote local stores exclusively through DMA operations. The PPE can also access the SPE local stores through DMAs. PPE accesses to an SPE local store are processe with higher priority than local loa an stores issue by the SPE. The size of the local store on current processor moels is 256 KB. SPE s execute coe rea irectly from their local stores an may issue a very limite number of system calls, incluing I/O calls. These calls utilize a stop-an-signal instruction an are route to the PPE for kernel-level processing. The PPE an SPE s have ifferent instruction sets, therefore applications running on the Cell/BE are ivie into two executables. The main executable runs on the PPE an uses a POSIX-like interface for creating an triggering threas on the SPE s. The SPE threas can utilize high-level vector processing library operations, expresse using irectives, to leverage the SIMD execution units, as well as a get/put interface to execute DMAs an access main memory an/or the local stores of other SPE s. SPE threa management, vector intrinsics an high-level primitives for DMA transfers are provie by a user-level runtime library (libspe2). The PPE an SPE may execute threas in parallel an synchronize through either DMAs, or a mailbox mechanism. SPE s are expecte to run through completion, as operating system support for preemptive time-slicing of SPE s is currently at an experimental stage. 3.3 Memory Interface Controller The Memory Interface Controller (MIC) is responsible for proviing the PPE an SPE s access to the main system memory. MIC supports a ual channel Rambus XIO macro that interfaces to external XDR Rambus DRAM. The XIO operates at the maximum frequency of 3.2 GHz. Each XIO channel can have eight memory banks with a total memory size of 256 MB, making the total memory size of a singleprocessor system limite to 512 MB. Observe peak raw memory banwith is state to be 25.6 GB/s at 3.2 GHz with both XIO channels [29], however such estimates for the peak banwith assume that all the banks are fully engage by incoming request streams, an all the requests are mae up of only reas or writes of 128 bytes. In the more typical case of blene reas an writes, the estimate effective banwith is 21 GB/s [45]. 3.4 I/OController The I/O controller is an off-chip component that provies interface to external network, isk, an other I/O evices. The Cell/BE I/O controller, calle FlexIO, is also base on Rambus. The FlexIO has twelve one-byte wie links, five of which are point-to-point inboun paths to the Cell/BE, an the remaining seven are outboun transmit links [14]. The links are configure in two logical interfaces, referre to as Input/Output Interfaces (IOIF). Each link operates at 5 GHz an the IOIF provies raw banwith of 35 GB/s outboun an 25 GB/s inboun. However, the actual ata an commans are transmitte as packets, which incur an overhea ue to presence of metaata such as comman ientifier, ata tags, an ata size [45]. As a result, the effective banwith that can be attaine is reuce to between 50% an 80% of the raw banwith. The operating system running on the PPE supports transparent application access to the I/O controller. The I/O requests from SPE s are hanle by the PPE operating system. Currently, the applications o not have irect access to the controller. 3.5 Element Interconnect Bus All the components making up the Cell/BE, i.e., the PPE, the SPE s, the off-chip I/O interfaces, an the MIC, communicate through a share Element Interconnect Bus (EIB) [28] an using DMA transfers, supporte by the MFCs. The EIB operates at half the system clock rate. The EIB is esigne as a circular ring comprise of four 16-bytes wie uniirectional ata channels. Two of the ata channels run in the clockwise irection, an the other two run in the anticlockwise irection. Each channel is capable of conveying up to three concurrent transactions. Both the PPE an SPE s use EIB to transfer ata to an from the main memory. The PPE accesses main memory with normal loa an store instructions through the EIB. The PPE can also issue DMA put an get commans to an from the local storage of SPE s, which can be mappe to the virtual aress space. An SPE accesses both main memory an the local storage of other SPE s exclusively with DMA commans. The MFC of each SPE runs at the same frequency as the EIB, supports naturally aligne DMA transfers of 1, 2, 4, 8, or a multiple of 16 bytes, an inclues a DMA list that can be use to execute up to 2048 DMA transfers with a single DMA comman. The maximum supporte size of a single DMA request from SPE s is 16 KB. Before sening ata on to the EIB, each requesting unit sens out a small number of initial comman requests. Each request on the EIB uses one comman creit. The number of creits reflects the size of the comman buffer of the EIB for that particular request. The EIB returns the creit back to the requesting unit when a slot becomes available in the comman buffer ue to a previous request moving ahea in the request pipeline. Different elements utilize the EIB by issuing a request which is queue by a bus arbiter process. In case of contention for the bus by multiple elements, the arbiter strives for an optimal allocation of ata channels to the requesters. In orer to avoi stalling any rea requests, highest priority is given to the memory controller, while all other components are treate equally with their requests serve in a roun-robin fashion. Furthermore, the ata ring is not grante to a requester if the requeste transfer woul interfere with any other ata transfer, or if it woul have to travel more than halfway aroun the ring to reach its estination. Each unit can simultaneously sen an receive 16 bytes of ata on every bus cycle on the EIB. The maximum ata banwith of the entire EIB is limite by the maximum rate at which aresses are snoope across all units in the sys- 25

4 File system access through various kernel subsystems VFS Interface filesystem services Buffer Cache hits Kernel I/O + Prefetching + I/O Clustering I/O Controller Disk isk requests File system access on SPE check for cache buffers Cache misses system call hanover Figure 2: I/O requests path on the PPE an SPE. tem, which is one per bus cycle. Since each aress request can transfer up to 128 bytes, the theoretical peak banwith on the EIB at 3.2 GHz is calculate to be 128 Bx(3.2/2) GHz = GB/s. However, the location of source an estination relative to each other, interference between existing an new transfers, number of the Cell/BE chips in the system, an whether the ata transfer is to local stores or to main memory, are some of the factors [45] that reuce sustaine ata banwith from its theoretical peak. Moreover, if all the ata requests are in the same irection, half of the rings will be ile, thus reucing the ata banwith on EIB to at least half of its peak value. 3.6 PPE/SPE I/O Path Figure 2 shows the path that application I/O requests from the PPE an SPE s follow before being service by the isk. The PPE supports a full operating system, an the I/O path on the PPE is a stanar one. All I/O requests through the VFS layer are first sent to the buffer cache. In case of miss in the buffer cache, a request to rea the ata from the isk is issue. The kernel clustering mechanisms combine multiple requests for contiguous blocks, an the kernel prefetching algorithm etects an prefetches blocks to reuce execution stalls ue to synchronous isk requests. Intereste reaers are referre to [9] for a etaile explanation of the stanar Linux I/O path. Since the SPE oes not support a native operating system, there is no kernel context on SPE an all system calls issue by SPE s are hanle as external assiste library calls. We now iscuss how external calls are supporte. Note that the same process is use to service all system calls from SPE s on the PPE, not just the I/O relate calls. The SPE uses special stop-an-signal instructions [26] to han over control to the PPE for hanling external service requests. In orer to perform an external assiste library call, the SPE allocates local store memory to hol the input an output parameters of the call an copies all the input parameters (from stack or registers) into this memory. It then combines a special function opcoe corresponing to the requeste service with the aress of this input/output memory image to form a 32-bit message. The SPE then places this message into the local store memory, immeiately followe by a stop instruction. It then signals the PPE to execute the library function on behalf of SPE. In PPE Time SPE Time SPE Context Creation Program Loaing on SPE Threa Creation on SPE Buffer Allocation File Reaing Buffer Deallocation Total SPE execution time Total time Table 1: Average time (in msec.) require by major tasks while reaing a 2 GB file from the isk on the PPE an a SPE. response to the signal, the PPE reas the assiste call message from SPE s local store, an uses the stop an signal type an opcoe to ispatch the control (PPE context) to the specifie assiste call hanler. The hanler on the PPE retrieves the input parameters for the assiste call from the local-store memory pointe to by the assiste call message, an executes the appropriate system call on the PPE. On completion of the system call, the return values are place into the same local store memory, an the SPE is signale to resume execution. Upon resumption, the library on the SPE reas the input value from the memory image an places them into the return registers, hence, completing the call. Thus all I/O calls on the SPE s are route through the PPE operating system. 4. EVALUATION OF I/O IMPROVING TECHNIQUES In this section, we present an evaluate a number of I/O improving techniques for the Cell/BE architecture. For our evaluation, we use ifferent I/O workloas execute on a Sony Play Station 3 (PS3). The PS3 is a hypervisorcontrolle platform. It has 6 active SPE s with 256 KB local storage, 256 MB of main memory of which about 200 MB is irectly accessible to the operating system (OS), an a 60 GB har isk. Although the Cell/BE has 8 SPE s, on the PS3, one SPE is reserve for running the hypervisor an another SPE is eactivate. Accesses to storage evices, incluing the isk, are route through the hypervisor with eicate hypercalls an their completion is communicate to the OS through virtual interrupts. Due to the proprietary nature of the PS3 hypervisor, it is not possible to assess its impose overhea on accesses to storage evices for the purpose of this work. In the following, we first evaluate the characteristics of our experimental platform by running simple workloas, then we explore how the SPE s can be use to hanle I/O intensive tasks such as ata encryption. Finally, to account for experimental errors the presente numbers represent averages over three ifferent runs unless otherwise state. 4.1 IentityTests In the first set of experiments, we create a workloa that reas a large file of size 2 GB. We refer to this experiment as the Ientity Test. The goal of this Test is to etermine the maximum I/O banwith available on our experimental platform for the PPE an SPE s. Table 1 shows the timing break own for the Ientity Test both on the PPE an a SPE using a block size of 16 KB. Note that context creation, loaing, an threa creation are only 26

5 block size 4 KB 16 KB Time (msec.) Throughput (MB/s) Table 2: The average time an observe throughput for reaing a 2 GB file from isk on the PPE an a SPE using ifferent block sizes. Buffer size Num. SPE s 1 KB 2 KB 4 KB 8 KB 16 KB Table 3: Time measure (in msec.) at PPE for sening ata to SPE through DMA uner varying buffer sizes, an for using one an six SPE s. neee when running the Test on the SPE. It is observe that the time to rea the file on the SPE is similar to that on the PPE. The table also shows that the cost of the context loaing steps on the SPE is relatively insignificant. However, this cost can become crucial if SPE workloas are repeately loae or if the execution time of the SPE program is small. Time (msec.) Time Throughput Number of SPE s Figure 3: Average time an observe throughput for simultaneously reaing a 2 GB file using 16 KB blocks from one to six SPE s. Next, we moifie the block size to 4 KB, an etermine the overall time it woul take to perform the Ientity Test both on the PPE an the SPE. Table 2 shows the results, an comparison with the previous case. We observe that while changing block sizes oes not have a significant effect on the PPE I/O throughput, the large block size gives better throughput on the SPE. The reason for this improve throughput is that ata transfers between SPE an the memory is one via DMA, an DMA is optimize by using the maximum transfer size per DMA operation, which for the Cell/BE is 16 KB. For this reason, in our remaining experiments we set the buffer size to 16 KB. Next, we repeate the Ientity Test while increasing the number of SPE s from one to six. Figure 3 shows the result. For this experiment, the PPE invokes one threa on each of the available SPE s, however, the total size of ata rea is same as before, i.e., 2 GB. A ifferent file is rea at each SPE so that unique requests are issue an any caching at the I/O controller an/or memory oes not come into play. As seen in the figure, the average observe throughput ecreases as the number of SPE s reaing the file increases. The average observe I/O throughput is reuce by 6.4% when all the six available SPE s are use, compare to the case of using a single SPE for the same amount of ata. This is ue to increase contention for the EIB, an inicates that simply offloaing I/O intensive jobs to multiple SPE s is unlikely to yiel the best use of resources Throughput (MB/s.) 4.2 Workloa-baseTests In this section, we present the results of running an I/O intensive workloa on the Cell/BE architecture. We first escribe our workloa, followe by etaile investigation of techniques to support the workloa on the PS WorkloaOverview The workloa that we have chosen is a 256-bit encryption/ecryption application. Our choice is ictate by the computation intensive component of the encryption an ecryption along with the nee to o large I/O transfers for reaing the input an writing the output. The workloa reas a file from the isk, encrypts or ecrypts it, an then writes back the results. Given that the PS3 has only about 200 MB of main memory available to user programs, we chose to encrypt a 64 MB file. This allows us to keep the entire file in memory if so neee an isolate the effects of buffer caching etc. We also vectorize the computation phase of our workloa to achieve high performance on SPE s, which improve the time taken in the computation phase by 42.1% Effect of DMA Request Size Our evaluation requires that the computation be offloae to specific SPE s. Therefore, we first evaluate the effect of DMA buffer sizes on such offloaing. Table 3 shows the time of computation offloaing as we varie the buffer size use for DMA communications between the PPE an SPE. Note that these buffers are ifferent from the file I/O block size of the previous experiments (which is fixe at 16 KB). We focuse on the ecryption phase of our workloa for this experiment. In this case, all I/O is performe at the PPE, which after reaing a full buffer of ata from isk, passes its aress in the main memory to a SPE. The SPE uses the passe aress to o a DMA transfer an brings the contents of the buffer to its local store. The SPE then processes the Time (msec.) Buffer size 4 KB 16 KB No. of times SPE is loae SPE loaing (excluing execution) time SPE execution (incluing loaing) time CPU time use by SPE Disk rea time Disk write time CPU time for isk rea operations CPU time for isk write operations Execution time of program CPU time use by PPE Table 4: Breakown of time spent (in msec.) in ifferent portions of the coe when ata is exchange between a SPE an the PPE through DMA buffer sizes of 4 KB an 16 KB. 27

6 PPE SPE Time SPE SPE PPE PPE Time Table 5: Time (in msec.) for reaing workloa file at PPE/SPE followe by access from SPE/PPE. ata in the local store, an upon completion of the computation issues another DMA to transfer the processe contents back to the main memory. Finally, the PPE can write the upate buffer in the main memory back to the isk. Note that the maximum size of a single channel DMA that can be sent on the EIB is 16 KB, thus the maximum DMA size of our experiments is limite to that. The whole experiment is repeate for two cases: using a single SPE, an using all six SPE s. These results show that increasing the buffer size improves the execution times of our workloa. Timingbreakownforthecasesof4KBan16KB DMA buffers. For the previously escribe experiment, we also performe a etaile timing analysis for 4 KB an 16 KB DMA s using a single SPE. Table 4 shows the results. This experiment was conucte to see the effect of ifferent DMA sizes on the time spent on various parts of the program. For the same input file, when the DMA size is increase from 4 KB to 16 KB, the number of times the PPE has to invoke a threa on an SPE is reuce by a factor of 4, thus reucing SPE loaing time. The number of times the SPE is loae to perform the same task also affects the total execution time, since it cuts own the number of times initialization is require on the SPE. Table 4 shows that the total execution time for the same workloa is less when SPE an the PPE communicate with each other through DMA operations an a block size of 16 KB, than using a block size of 4 KB for the same ata set. Observe that the total execution time is significantly less when using 16 KB blocks compare to 4 KB blocks. This is ue to the fact that the total time also inclues the time require at SPE to fetch the ata into its local store through DMA operations, an the number of DMA operations one by SPE for 16 KB blocks is 4 times less than that for 4 KB blocks for the same ata set Impact of File Caching As iscusse in Section 3, the I/O system calls from the SPE are hane over to the PPE for hanling. This implies that once a file (or portion of a file) is accesse by the PPE it may be in memory when subsequent access for the file are issue from a SPE or the PPE, an these accesses can be service fast. In this experiment, we aim at confirming this empirical observation. First, we flushe any file cache by reaing a large file (2 GB). Then we rea the 64 MB workloa file on the PPE, followe by reaing the same file at a SPE. Table 5 shows the result for reaing a file col first on the PPE, followe by reaing at SPE. The same experiment is repeate for first reaing the file at a SPE, followe by at the PPE. From the table, we conclue that the caching effect is noticeable, an can help in reucing I/O times both on the PPE an on the SPE s, by first reaing a file on the PPE. We also notice that file reaing on the SPE is slower ue to the I/O being route through the PPE. Time (msec.) j f b b j f f b c a a a a a i j i c h e j h i e Scheme 1 Scheme 2 Scheme 3 Scheme 4 Scheme 5 Scheme 6 i h g e Figure 4: Timing Breakown of ifferent tasks for all the six schemes. The bars are labele as follows: (a) File rea at PPE, (b) File rea at SPE, (c) DMA rea at SPE, () encryption, (e) DMA write at SPE, (f) File write at SPE, (g) Write wait at PPE, (h) File write at PPE, (i) Waiting time, an (j) Miscellaneous. Given the effectiveness of file caching, we now explore a number of schemes to improve I/O performance of our workloa. For the following experiments, we utilize the encryption phase of our workloa. Figure 4 shows the results. In some schemes tasks are execute in parallel at the PPE an SPE s. This is shown as two sie-by-sie bars for a scheme, with the total execution time ictate by the higher of the two bars. The breakown for various steps is also shown. Scheme 1: SPE performs all tasks. Uner this scheme, we perform all the tasks of our workloa, i.e., reaing the input file (b), processing it (), an writing the output file (f), on the SPE. Note, however, that we still utilize the PPE to invoke the tasks as a single program on the SPE. Scheme 2: Synchronous File Prefetching by the PPE. In this scheme, we attempt to improve the overall performance of our workloa by allowing the PPE to prefetch the input file in memory. This scheme is riven by the above observation that subsequent accesses by SPE s to a file rea earlier by the PPE improves I/O times ue to file caching. For this purpose, the PPE first pre-reas the entire file causing it to be brought in memory. Then the program from Scheme 1 is execute as before. Results in Figure 4 show that the File rea at SPE (b) is much faster for this scheme, compare to Scheme 1. However, the time it takes to rea the file on the PPE (a) is 81.6% longer compare to File rea at SPE (b) in Scheme 1. We believe this is ue to the PPE flooing the I/O controller queue, an lack of overlapping opportunities between computation an I/O in a sequential rea compare to the rea an process cycle of Scheme 1. Hence, Scheme 2 shows promise in terms of improving SPE rea times, but suffers from slow I/O times on the PPE. The overall workloa execution time is longer in Scheme 2 than Scheme 1. c i 28

7 Scheme 1 Time Scheme 6 Time Rea at 1 SPE Rea at PPE Process at 1 SPE Write at 1 SPE DMA rea at 1 SPE DMA write at 1 SPE DMA wait at 1 SPE - 25 Write wait at PPE - 99 Write at PPE Total (6 SPE s) Table 6: Timing breakown (in msec.) of various tasks for scalability tests for Scheme 1 an Scheme 6 by using six SPE s. Scheme 3: Asynchronous Prefetching by the PPE. In the next scheme, we try to remove the file reaing bottleneck of Scheme 2. For this purpose, we create a separate threa to prefetch the file into memory. Simultaneously, we offloae the program of Scheme 1 to the SPE. The goal is to allow the prefetching by the PPE to overlap with computation on SPE, thus any ata accesse by SPE will alreay be in memory an the overall performance of the workloa will improve. Note that we o not have to worry about synchronizing the prefetching threa on the PPE with the I/O on SPE. In case the PPE threa is ahea of SPE, no problems woul arise. However, if the SPE gets ahea of the PPE threa, the SPE s I/O request will automatically cause the ata to be brought into memory, which in turn will make the PPE rea the file faster, thus once again getting ahea of the SPE. The integrity of ata rea by SPE will not be compromise. It is observe from the results in Figure 4 that although the I/O times (a) for iniviual steps increase, better I/O/computation overlapping resulte in an overall improvement of 4.7%, compare to Scheme 2. This shows that the PPE can facilitate I/O for SPE s an oing so results in improve performance. Scheme 4: Synchronous DMA by the SPE. So far, we have attempte to improve SPE performance by inirectly bringing the file in memory an implicitly improving the performance of the SPE workloa. However, such schemes are prone to problems if the system flushes the file rea by the PPE from the buffer cache before it can be rea by SPE, hence negating any avantage of a PPE-assiste prefetch. In this scheme, we explicitly prefetch the file on the PPE an give the SPE the aress of memory where the file ata is available. The SPE program is moifie to not o irect I/O, rather use the aresses provie by the PPE. Hence, the PPE will rea the input file in memory, give its aress to the SPE to process, the SPE will create the output in memory, an finally the PPE will write the file back to the isk. The SPE will use DMA to map portions of the mappe file to its local store an sen the results back. Figure 4 shows the results. Here, we observe that the DMA rea at SPE (c) takes 55.0% an 62.0% less time than File rea at SPE (b) in Scheme 2 an Scheme 3, respectively. However, the synchronous reaing of file in this scheme takes long, causing the overall times to not improve as much: 4.9% an 0.2% compare to Scheme 2 an Scheme 3, respectively. Scheme 5: Asynchronous DMA by the SPE. To mitigate the effect of blocking rea, we once again try the approach of Scheme 3 an utilize a separate threa to rea the input file asynchronously. The DMA hanover an processing at SPE are similar to that of Scheme 4. Figure 4 shows the results. A caveat here is that there is no automatic syncing of the prefetch threa an the SPE process, as was the case in Scheme 3. If the SPE process got ahea of the PPE prefetch threa, it will process junk ata from memory where the input file has not been loae. Hence, although this scheme is promising, it cannot guarantee the correctness of the operation. Scheme6:AsynchronousDMAbytheSPEwithSignaling. The main shortcoming in Scheme 5 is the lack of a signaling mechanism between the prefetching threa proucing the ata (reaing into memory) an the SPE consuming the ata. One way to aress this to use the mailbox abstraction supporte by the Cell/BE. However, ocumentation [27] avises against using mailboxes given their slow performance. Therefore, we use DMA-base share memory as a signaling mechanism to keep the prefetching threa synchronize with the SPE s. The PPE starts a threa to rea the input file, an simultaneously also starts the SPE process. The ifference from Scheme 5 is that the prefetching threa continuously upates a status location in main memory with the offset of the file rea so far, an uses this location to etermine how much of the ata has been prouce by SPE for writing back to the output file. Moreover, the SPE process, instea of blinly accessing memory assuming it contains vali input ata, perioically uses DMA to access a pre-specifie memory status location. In case the prefetching threa is lagging, the SPE process will busywait an recheck the status location until the require ata is loae into memory. Finally, the SPE can also use the share location to specify the amount of processe output. This allows the PPE to simultaneously write back the output to the isk, an achieve an aitional improvement over Scheme 5 where output was written back only after the entire input was processe. Thus, Scheme 6 achieves both reaing of the input file an writing of output file in parallel with the processing of the ata. Figure 4 shows the results, which are quite promising. Scheme 6 achieves 22.2%, 24.1%, an 24.0% improvement in overall performance compare to Scheme 1, Scheme 3, an Scheme 4, respectively. Scalability Test. In orer to test the scalability of Scheme 6, we teste it by fully parallelizing it to 6 available SPE s on the Cell/BE, an compare the result with the scale version of Scheme 1, where all the I/O is being manage by the SPE s. For this experiment, we mae the following changes to the workloa of Scheme 1 an Scheme 6, while keeping the total input size unchange (i.e. 64 MB). For Scheme 1, the PPE starts one threa for each of the six SPE s. Each SPE threa reas an input file of MB, encrypts it, an writes the resulting buffer back to the isk. The total input size across all the SPE s remains 64 MB. For Scheme 6, we still rea the 64 MB file on the PPE, but instea of giving the entire workloa to a single SPE, it is evenly istribute among the six SPE s. Each SPE processes its portion of the buffere ata as follows. The 29

8 first 16 KB block in the input buffer is processe by one SPE, the next 16 KB block in the same buffer is processe by another SPE, an so on. Once the PPE has rea the file completely in main memory, it waits for the output to be prouce by the SPE s before writing it back to the isk. Table 6 shows the results of this experiment. The total time for Scheme 1 also inclues the time spent by PPE to wait for all the six SPE threas to complete their execution (i.e. barrier time). Note that parallelizing rea an write operations among SPE s provies consierable speeup, 44.7%, 44.0% respectively, compare to Scheme 1 in Section 4.2 where the same amount of ata is rea an written by a single SPE. Also note that average time to rea the file at each SPE is less than the total file reaing time on the PPE because each SPE reas only a fraction of the file rea by the PPE. The results show that Scheme 6 performs better (8.7%) than Scheme 1, when all available SPE s are utilize in the Cell/BE, albeit by a narrower margin compare to the case where a single SPE is use. Scheme 6 improves performance only by about 23.9% when it is scale from one to six SPE s. This result is attribute to several reasons. First, the file reaing time for the scale version of Scheme 6 is consierably more than the original version because here the PPE also has to compute the reserve status locations base upon the block number that has just been rea from the isk. File reaing time also increases because of EIB contention since now all six available SPE s along with the PPE are using the EIB to rea an write ata in main memory. Seconly, the DMA wait time at each SPE increases significantly in the scale version of Scheme 6 (25 msec.) as compare to Scheme 6 using a single SPE (0.07 msec.), although each SPE in the former case is require to process 1/6 of the total ata processe in the latter case. This happens also because of EIB contention an suboptimal routing of the DMA requests on the EIB rings. 4.3 Discussion Our evaluation has shown that to achieve goo I/O performance, the I/O block sizes an the DMA buffer sizes shoul be matche to the maximum DMA channel size of 16 KB. Further, we observe a clear benefit of prefetching a file using the PPE an then offloaing it to SPE, rather than letting the SPE s o the I/O irectly. One observe bottleneck is that all I/O from SPE s is sent to the PPE for hanling, which results in performance egraation. Our DMA base approach using signaling provie best performance for our workloa. We recommen using similar techniques for I/O intensive workloas with the current OS implementation on the Cell/BE. An important observation is that by allowing SPE s to o DMA to a prefetche file in memory, the bottleneck of oing centralize I/O is remove: each SPE irectly goes to the memory through DMAs rather than going through the PPE for I/O. This inicates that incorporating I/O functionality in the SPE library coe rather than relying on the PPE OS can yiel promising results. The trae-off lies in the fact that loaing full I/O capabilities onto the SPE s reuces the space available in SPE local storage for running other compute-intensive tasks. We plan to explore this avenue, by eveloping irect I/O functionality in libspe an investigating the aforementione trae-off in the context of realistic I/O workloas. 5. RELATEDWORK We iscuss relate research on the Cell/BE an on prefetching for improving I/O performance. Cell/BE The Cell/BE has been the subject of several application stuies, incluing particle transport coes [41], numerical kernels [1, 3], irregular graph algorithms [4], an algorithms for sequence alignment an phylogenetic tree construction [7, 44]. More recent stuies explores the potential of the Cell/BE for accelerating the processing of large ata volumes an use the Cell/BE to implement fast sorting [19], query processing [23], an ata mining [8] algorithms. Our contribution eparts from earlier work by focusing on the implementation of I/O operations in the Cell/BE system software stack. The Cell/BE has also spurre several efforts for eveloping high-level programming moels an supporting environments for simplifying coe evelopment an optimization. These efforts inclue Sequoia [17], Cell SuperScalar [6], CorePy [37] an PPE-SPE coe generators from singlesource moules [16, 49]. Our research is base on the generic Linux I/O interfaces, however it is conceptually relate to programming moels that explicitly manage the memory hierarchy by staging ata vertically through the machine an localizing computation to specific layers of the memory hierarchy [17]. Prefetching A key technique for improving I/O performance of workloas is prefetching, which ates back to as early as Multics [18]. A large amount of work on I/O prefetching utilizes hints about an application s I/O behavior, e.g., programmer-inserte hints [12, 39], compiler-inferre hints [36], an hints prescribe by a binary rewriter [13]. Alternatively, ynamic prefetching has been propose that etects applications reference patterns at runtime, e.g., preiction using probability graphs [21, 50], an time series moeling [47]. Prefetch algorithms tailore for parallel I/O systems also have been stuie [2, 30, 31]. Speculative prefetching at the level of whole files or atabase objects has been propose by many works [15, 20, 34, 35, 38]. The interaction between prefetching an caching has also been ientifie [9, 10]. Base on these interactions, a number of works have propose integrate caching an prefetching schemes [2, 11, 30, 31, 32, 39, 46] that simultaneously ientify an hanle temporal an spatial I/O access patterns. FlexiCache [33] provies a new flexible interface that allows easy moification of isk cache management ecisions using OS-level moules. In this paper, we explore how basic prefetching techniques can be employe to improve the performance of I/O intensive workloas on the Cell/BE architecture. To the best of our knowlege, this is the first exploration of such techniques in the Cell/BE setting. 6. CONCLUSION We investigate prefetching-base techniques for supporting I/O intensive workloas involving significant computation components on the Cell/BE architecture. We observe that the current operating system facilities for performing I/O irectly on accelerator cores (SPE s) are limite, an o not provie juicious use of the available resources. A particular concern is that currently, I/O on SPE s is rei- 30

9 recte to the PPE, hence creating a central bottleneck. We have presente an asynchronous prefetching-base approach that partially breaks up this bottleneck, utilizes ecentralize DMA to achieve 22.2% better performance for our workloa compare to the case where all I/O is hanle at the SPE. However, we argue that a funamentally better approach woul be to exten SPE support libraries with I/O functionality, thus removing the epenence on the PPE, simplifying the SPE program esign for I/O intensive workloas, an improving overall performance. We are currently investigating the feasibility of such library support. Acknowlegment This research is supporte by the NSF (grants CCF , CCF , CNS , CNS , CNS , CNS , NSF CAREER Awar CCF ), the DOE (grants DE-FG02-06ER25751, DE- FG02-05ER25689), an by IBM through an IBM Faculty Awar (Virginia Tech Founation grant VTF ). In aition, M. Mustafa Rafique is supporte by a Scholarship from the Fulbright Foreign Stuent Program of the U.S. Department of State, fune in part by the Government of Pakistan. 7. REFERENCES [1] S. Alam, J. Mereith, an J. Vetter. Balancing Prouctivity an Performance on the Cell Broaban Engine. In Proc. IEEE CLUSTER, [2] S. Albers an M. Büttner. Integrate prefetching an caching in single an parallel isk systems. In Proc. ACM SPAA, [3] D. Baer an V. Agarwal. FFTC: Fastest Fourier Transform for the IBM Cell Broaban Engine. In Proc. IEEE HiPC, [4] D. A. Baer, V. Agarwal, an K. Mauri. On the esign an analysis of irregular algorithms on the cell processor: A case stuy of list ranking. In Proc. IEEE IPDPS, [5] S. Balakrishnan, R. Rajwar, M. Upton, an K. Lai. The Impact of Performance Asymmetry in Emerging Multicore Architectures. In Proc. ISCA, [6] P. Bellens, J. M. Pérez, R. M. Baia, an J. Labarta. CellSs: a Programming Moel for the Cell BE Architecture. In Proc. SC, [7] F. Blagojevic, A. Stamatakis, C. Antonopoulos, an D. Nikolopoulos. RAxML-CELL: Parallel Phylogenetic Tree Construction on the Cell Broaban Engine. In Proc. IEEE IPDPS, [8] G. Buehrer an S. Parthasarathy. The Potential of the Cell Broaban Engine for Data Mining. Technical Report TR , Department of Computer Science an Engineering, Ohio State University, [9] A. R. Butt, C. Gniay, an Y. C. Hu. The performance impact of kernel prefetching on buffer cache replacement algorithms. IEEE ToC, 56(7): , [10] P. Cao, E. W. Felten, A. R. Karlin, an K. Li. A stuy of integrate prefetching an caching strategies. SIGMETRICS Performance Evaluation Review, 23(1): , [11] P. Cao, E. W. Felten, A. R. Karlin, an K. Li. Implementation an performance of integrate application-controlle file caching, prefetching, an isk scheuling. ACM TOCS, 14(4): , [12] P. Cao, E. W. Felten, an K. Li. Implementation an performance of application-controlle file caching. In Proc. USENIX OSDI, [13] F. W. Chang an G. A. Gibson. Automatic I/O hint generation through speculative execution. In Proc. USENIX OSDI, [14] K. Chang, S. Pamarti, K. Kaviani, E. Alon, X. Shi, T. Chin, J. Shen, G. Yip, C. Maen, R. Schmitt, C. Yuan, F. Assaeraghi, an M. Horowitz. Clocking an Circuit Design for a Parallel I/O on a First-Generation Cell Processor. In Proc. IEEE ISSCC, [15] K. M. Curewitz, P. Krishnan, an J. S. Vitter. Practical prefetching via ata compression. In Proc. ACM SIGMOD, [16] A. E. Eichenberger et al. Using Avance Compiler Technology to Exploit the Performance of the Cell Broaban Engine Architecture. IBM Systems Journal, 45(1):59 84, [17] K. Fatahalian et al. Sequoia: Programming the Memory Hierarchy. In Proc. SC, [18] R. J. Feiertag an E. I. Organik. The Multics input/output system. In Proc. ACM SOSP, [19] B. Geik, R. Borawekar, an P. S. Yu. Cellsort: High performance sorting on the cell processor. In Proc. VLDB, [20] J. Griffioen an R. Appleton. Reucing file system latency using a preictive approach. In Proc. USENIX Summer Technical Conf., [21] J. Griffioen an R. Appleton. Performance measurements of automatic prefetching. In Proc. PDSC, [22] M. Gschwin, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, an T. Yamazaki. Synergistic processing in cell s multicore architecture. IEEE Micro, 26(2):10 24, [23] S. Heman, N. Nes, M. Zukowski, an P. Boncz. Vectorize Data Processing on the Cell Broaban Engine. In Proc. DaMoN, [24] M. Hill an M. Marty. Amahl s Law in the Multi-core Era. Technical Report 1593, Department of Computer Sciences, University of Wisconsin-Maison, Mar [25] H. P. Hofstee. Power efficient processor architecture an the cell processor. In Proc. IEEE HPCA, [26] IBM Corp. Cell Broaban Engine Linux Reference Implementation Application Binary Interface Specification, Version 1.0, November [27] IBM Corp. Cell Broaban Engine Software Development Kit 2.1 Programmer s Guie (Version 2.1) [28] IBM Corp. Cell Broaban Engine Architecture (Version 1.02) [29] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, an D. Shippy Introuction to the cell microprocessor. IBM Journal of Research an Development, 49(4/5): , [30] M. Kallahalla an P. J. Varman. Optimal prefetching an caching for parallel I/O sytems. In Proc. ACM SPAA,

Coupling the User Interfaces of a Multiuser Program

Coupling the User Interfaces of a Multiuser Program Coupling the User Interfaces of a Multiuser Program PRASUN DEWAN University of North Carolina at Chapel Hill RAJIV CHOUDHARY Intel Corporation We have evelope a new moel for coupling the user-interfaces

More information

6.823 Computer System Architecture. Problem Set #3 Spring 2002

6.823 Computer System Architecture. Problem Set #3 Spring 2002 6.823 Computer System Architecture Problem Set #3 Spring 2002 Stuents are strongly encourage to collaborate in groups of up to three people. A group shoul han in only one copy of the solution to the problem

More information

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama an Hayato Ohwaa Faculty of Sci. an Tech. Tokyo University of Science, 2641 Yamazaki, Noa-shi, CHIBA, 278-8510, Japan hiroyuki@rs.noa.tus.ac.jp,

More information

Chapter 9 Memory Management

Chapter 9 Memory Management Contents 1. Introuction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threas 6. CPU Scheuling 7. Process Synchronization 8. Dealocks 9. Memory Management 10.Virtual Memory

More information

Study of Network Optimization Method Based on ACL

Study of Network Optimization Method Based on ACL Available online at www.scienceirect.com Proceia Engineering 5 (20) 3959 3963 Avance in Control Engineering an Information Science Stuy of Network Optimization Metho Base on ACL Liu Zhian * Department

More information

MODULE VII. Emerging Technologies

MODULE VII. Emerging Technologies MODULE VII Emerging Technologies Computer Networks an Internets -- Moule 7 1 Spring, 2014 Copyright 2014. All rights reserve. Topics Software Define Networking The Internet Of Things Other trens in networking

More information

Computer Organization

Computer Organization Computer Organization Douglas Comer Computer Science Department Purue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purue.eu/people/comer Copyright 2006. All rights reserve.

More information

Online Appendix to: Generalizing Database Forensics

Online Appendix to: Generalizing Database Forensics Online Appenix to: Generalizing Database Forensics KYRIACOS E. PAVLOU an RICHARD T. SNODGRASS, University of Arizona This appenix presents a step-by-step iscussion of the forensic analysis protocol that

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Computer Organization

Computer Organization Computer Organization Douglas Comer Computer Science Department Purue University 250 N. University Street West Lafayette, IN 47907-2066 http://www.cs.purue.eu/people/comer Copyright 2006. All rights reserve.

More information

Overview. Operating Systems I. Simple Memory Management. Simple Memory Management. Multiprocessing w/fixed Partitions.

Overview. Operating Systems I. Simple Memory Management. Simple Memory Management. Multiprocessing w/fixed Partitions. Overview Operating Systems I Management Provie Services processes files Manage Devices processor memory isk Simple Management One process in memory, using it all each program nees I/O rivers until 96 I/O

More information

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control

Almost Disjunct Codes in Large Scale Multihop Wireless Network Media Access Control Almost Disjunct Coes in Large Scale Multihop Wireless Network Meia Access Control D. Charles Engelhart Anan Sivasubramaniam Penn. State University University Park PA 682 engelhar,anan @cse.psu.eu Abstract

More information

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2 This paper appears in J. of Parallel an Distribute Computing 10 (1990), pp. 167 181. Intensive Hypercube Communication: Prearrange Communication in Link-Boun Machines 1 2 Quentin F. Stout an Bruce Wagar

More information

Just-In-Time Software Pipelining

Just-In-Time Software Pipelining Just-In-Time Software Pipelining Hongbo Rong Hyunchul Park Youfeng Wu Cheng Wang Programming Systems Lab Intel Labs, Santa Clara What is software pipelining? A loop optimization exposing instruction-level

More information

NAND flash memory is widely used as a storage

NAND flash memory is widely used as a storage 1 : Buffer-Aware Garbage Collection for Flash-Base Storage Systems Sungjin Lee, Dongkun Shin Member, IEEE, an Jihong Kim Member, IEEE Abstract NAND flash-base storage evice is becoming a viable storage

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Message Transport With The User Datagram Protocol

Message Transport With The User Datagram Protocol Message Transport With The User Datagram Protocol User Datagram Protocol (UDP) Use During startup For VoIP an some vieo applications Accounts for less than 10% of Internet traffic Blocke by some ISPs Computer

More information

Impact of FTP Application file size and TCP Variants on MANET Protocols Performance

Impact of FTP Application file size and TCP Variants on MANET Protocols Performance International Journal of Moern Communication Technologies & Research (IJMCTR) Impact of FTP Application file size an TCP Variants on MANET Protocols Performance Abelmuti Ahme Abbasher Ali, Dr.Amin Babkir

More information

On the Role of Multiply Sectioned Bayesian Networks to Cooperative Multiagent Systems

On the Role of Multiply Sectioned Bayesian Networks to Cooperative Multiagent Systems On the Role of Multiply Sectione Bayesian Networks to Cooperative Multiagent Systems Y. Xiang University of Guelph, Canaa, yxiang@cis.uoguelph.ca V. Lesser University of Massachusetts at Amherst, USA,

More information

Skyline Community Search in Multi-valued Networks

Skyline Community Search in Multi-valued Networks Syline Community Search in Multi-value Networs Rong-Hua Li Beijing Institute of Technology Beijing, China lironghuascut@gmail.com Jeffrey Xu Yu Chinese University of Hong Kong Hong Kong, China yu@se.cuh.eu.h

More information

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks

MORA: a Movement-Based Routing Algorithm for Vehicle Ad Hoc Networks : a Movement-Base Routing Algorithm for Vehicle A Hoc Networks Fabrizio Granelli, Senior Member, Giulia Boato, Member, an Dzmitry Kliazovich, Stuent Member Abstract Recent interest in car-to-car communications

More information

Recitation Caches and Blocking. 4 March 2019

Recitation Caches and Blocking. 4 March 2019 15-213 Recitation Caches an Blocking 4 March 2019 Agena Reminers Revisiting Cache Lab Caching Review Blocking to reuce cache misses Cache alignment Reminers Due Dates Cache Lab (Thursay 3/7) Miterm Exam

More information

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation

Random Clustering for Multiple Sampling Units to Speed Up Run-time Sample Generation DEIM Forum 2018 I4-4 Abstract Ranom Clustering for Multiple Sampling Units to Spee Up Run-time Sample Generation uzuru OKAJIMA an Koichi MARUAMA NEC Solution Innovators, Lt. 1-18-7 Shinkiba, Koto-ku, Tokyo,

More information

Coordinating Distributed Algorithms for Feature Extraction Offloading in Multi-Camera Visual Sensor Networks

Coordinating Distributed Algorithms for Feature Extraction Offloading in Multi-Camera Visual Sensor Networks Coorinating Distribute Algorithms for Feature Extraction Offloaing in Multi-Camera Visual Sensor Networks Emil Eriksson, György Dán, Viktoria Foor School of Electrical Engineering, KTH Royal Institute

More information

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA Implementation an Evaluation of AS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA Kazuya Matsumoto 1, orihisa Fujita 2, Toshihiro Hanawa 3, an Taisuke Boku 1,2 1 Center for Computational

More information

Compiler Optimisation

Compiler Optimisation Compiler Optimisation Michael O Boyle mob@inf.e.ac.uk Room 1.06 January, 2014 1 Two recommene books for the course Recommene texts Engineering a Compiler Engineering a Compiler by K. D. Cooper an L. Torczon.

More information

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization

Offloading Cellular Traffic through Opportunistic Communications: Analysis and Optimization 1 Offloaing Cellular Traffic through Opportunistic Communications: Analysis an Optimization Vincenzo Sciancalepore, Domenico Giustiniano, Albert Banchs, Anreea Picu arxiv:1405.3548v1 [cs.ni] 14 May 24

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

Questions? Post on piazza, or  Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)! EE122 Fall 2013 HW3 Instructions Recor your answers in a file calle hw3.pf. Make sure to write your name an SID at the top of your assignment. For each problem, clearly inicate your final answer, bol an

More information

Loop Scheduling and Partitions for Hiding Memory Latencies

Loop Scheduling and Partitions for Hiding Memory Latencies Loop Scheuling an Partitions for Hiing Memory Latencies Fei Chen Ewin Hsing-Mean Sha Dept. of Computer Science an Engineering University of Notre Dame Notre Dame, IN 46556 Email: fchen,esha @cse.n.eu Tel:

More information

You Can Do That. Unit 16. Motivation. Computer Organization. Computer Organization Design of a Simple Processor. Now that you have some understanding

You Can Do That. Unit 16. Motivation. Computer Organization. Computer Organization Design of a Simple Processor. Now that you have some understanding .. ou Can Do That Unit Computer Organization Design of a imple Clou & Distribute Computing (CyberPhysical, bases, Mining,etc.) Applications (AI, Robotics, Graphics, Mobile) ystems & Networking (Embee ystems,

More information

Non-homogeneous Generalization in Privacy Preserving Data Publishing

Non-homogeneous Generalization in Privacy Preserving Data Publishing Non-homogeneous Generalization in Privacy Preserving Data Publishing W. K. Wong, Nios Mamoulis an Davi W. Cheung Department of Computer Science, The University of Hong Kong Pofulam Roa, Hong Kong {wwong2,nios,cheung}@cs.hu.h

More information

Impact of cache interferences on usual numerical dense loop. nests. O. Temam C. Fricker W. Jalby. University of Leiden INRIA University of Versailles

Impact of cache interferences on usual numerical dense loop. nests. O. Temam C. Fricker W. Jalby. University of Leiden INRIA University of Versailles Impact of cache interferences on usual numerical ense loop nests O. Temam C. Fricker W. Jalby University of Leien INRIA University of Versailles Niels Bohrweg 1 Domaine e Voluceau MASI 2333 CA Leien 78153

More information

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar

Roadrunner. By Diana Lleva Julissa Campos Justina Tandar Roadrunner By Diana Lleva Julissa Campos Justina Tandar Overview Roadrunner background On-Chip Interconnect Number of Cores Memory Hierarchy Pipeline Organization Multithreading Organization Roadrunner

More information

On Effectively Determining the Downlink-to-uplink Sub-frame Width Ratio for Mobile WiMAX Networks Using Spline Extrapolation

On Effectively Determining the Downlink-to-uplink Sub-frame Width Ratio for Mobile WiMAX Networks Using Spline Extrapolation On Effectively Determining the Downlink-to-uplink Sub-frame With Ratio for Mobile WiMAX Networks Using Spline Extrapolation Panagiotis Sarigianniis, Member, IEEE, Member Malamati Louta, Member, IEEE, Member

More information

Advanced method of NC programming for 5-axis machining

Advanced method of NC programming for 5-axis machining Available online at www.scienceirect.com Proceia CIRP (0 ) 0 07 5 th CIRP Conference on High Performance Cutting 0 Avance metho of NC programming for 5-axis machining Sergej N. Grigoriev a, A.A. Kutin

More information

A Plane Tracker for AEC-automation Applications

A Plane Tracker for AEC-automation Applications A Plane Tracker for AEC-automation Applications Chen Feng *, an Vineet R. Kamat Department of Civil an Environmental Engineering, University of Michigan, Ann Arbor, USA * Corresponing author (cforrest@umich.eu)

More information

Improving Performance of Sparse Matrix-Vector Multiplication

Improving Performance of Sparse Matrix-Vector Multiplication Improving Performance of Sparse Matrix-Vector Multiplication Ali Pınar Michael T. Heath Department of Computer Science an Center of Simulation of Avance Rockets University of Illinois at Urbana-Champaign

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Multilevel Paging. Multilevel Paging Translation. Paging Hardware With TLB 11/13/2014. CS341: Operating System

Multilevel Paging. Multilevel Paging Translation. Paging Hardware With TLB 11/13/2014. CS341: Operating System CS341: Operating System Lect31: 21 st Oct 2014 Dr A Sahu Dept o Comp Sc & Engg Inian Institute o Technology Guwahati ain Contiguous Allocation, Segmentation, Paging Page Table an TLB Paging : Larger Page

More information

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks

Queueing Model and Optimization of Packet Dropping in Real-Time Wireless Sensor Networks Queueing Moel an Optimization of Packet Dropping in Real-Time Wireless Sensor Networks Marc Aoun, Antonios Argyriou, Philips Research, Einhoven, 66AE, The Netherlans Department of Computer an Communication

More information

Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations

Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations 2012 Thir International Conference on Networking an Computing Towars a Low-Power Accelerator of Many FPGAs for Stencil Computations Ryohei Kobayashi Tokyo Institute of Technology, Japan E-mail: kobayashi@arch.cs.titech.ac.jp

More information

Supporting Fully Adaptive Routing in InfiniBand Networks

Supporting Fully Adaptive Routing in InfiniBand Networks XIV JORNADAS DE PARALELISMO - LEGANES, SEPTIEMBRE 200 1 Supporting Fully Aaptive Routing in InfiniBan Networks J.C. Martínez, J. Flich, A. Robles, P. López an J. Duato Resumen InfiniBan is a new stanar

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method

Transient analysis of wave propagation in 3D soil by using the scaled boundary finite element method Southern Cross University epublications@scu 23r Australasian Conference on the Mechanics of Structures an Materials 214 Transient analysis of wave propagation in 3D soil by using the scale bounary finite

More information

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM

The Pennsylvania State University. The Graduate School. College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM The Pennsylvania State University The Graduate School College of Engineering PFFTC: AN IMPROVED FAST FOURIER TRANSFORM FOR THE IBM CELL BROADBAND ENGINE A Thesis in Computer Science and Engineering by

More information

Reconstructing the Nonlinear Filter Function of LILI-128 Stream Cipher Based on Complexity

Reconstructing the Nonlinear Filter Function of LILI-128 Stream Cipher Based on Complexity Reconstructing the Nonlinear Filter Function of LILI-128 Stream Cipher Base on Complexity Xiangao Huang 1 Wei Huang 2 Xiaozhou Liu 3 Chao Wang 4 Zhu jing Wang 5 Tao Wang 1 1 College of Engineering, Shantou

More information

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST Chapter 5 Memory Hierarchy Design In-Cheol Park Dept. of EE, KAIST Why cache? Microprocessor performance increment: 55% per year Memory performance increment: 7% per year Principles of locality Spatial

More information

Workspace as a Service: an Online Working Environment for Private Cloud

Workspace as a Service: an Online Working Environment for Private Cloud 2017 IEEE Symposium on Service-Oriente System Engineering Workspace as a Service: an Online Working Environment for Private Clou Bo An 1, Xuong Shan 1, Zhicheng Cui 1, Chun Cao 2, Donggang Cao 1 1 Institute

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Throughput Characterization of Node-based Scheduling in Multihop Wireless Networks: A Novel Application of the Gallai-Edmonds Structure Theorem

Throughput Characterization of Node-based Scheduling in Multihop Wireless Networks: A Novel Application of the Gallai-Edmonds Structure Theorem Throughput Characterization of Noe-base Scheuling in Multihop Wireless Networks: A Novel Application of the Gallai-Emons Structure Theorem Bo Ji an Yu Sang Dept. of Computer an Information Sciences Temple

More information

Shift-map Image Registration

Shift-map Image Registration Shift-map Image Registration Svärm, Linus; Stranmark, Petter Unpublishe: 2010-01-01 Link to publication Citation for publishe version (APA): Svärm, L., & Stranmark, P. (2010). Shift-map Image Registration.

More information

Research Article REALFLOW: Reliable Real-Time Flooding-Based Routing Protocol for Industrial Wireless Sensor Networks

Research Article REALFLOW: Reliable Real-Time Flooding-Based Routing Protocol for Industrial Wireless Sensor Networks Hinawi Publishing Corporation International Journal of Distribute Sensor Networks Volume 2014, Article ID 936379, 17 pages http://x.oi.org/10.1155/2014/936379 Research Article REALFLOW: Reliable Real-Time

More information

Lab work #8. Congestion control

Lab work #8. Congestion control TEORÍA DE REDES DE TELECOMUNICACIONES Grao en Ingeniería Telemática Grao en Ingeniería en Sistemas e Telecomunicación Curso 2015-2016 Lab work #8. Congestion control (1 session) Author: Pablo Pavón Mariño

More information

Optimizing the quality of scalable video streams on P2P Networks

Optimizing the quality of scalable video streams on P2P Networks Optimizing the quality of scalable vieo streams on PP Networks Paper #7 ASTRACT The volume of multimeia ata, incluing vieo, serve through Peer-to-Peer (PP) networks is growing rapily Unfortunately, high

More information

Baring it all to Software: The Raw Machine

Baring it all to Software: The Raw Machine Baring it all to Software: The Raw Machine Elliot Waingol, Michael Taylor, Vivek Sarkar, Walter Lee, Victor Lee, Jang Kim, Matthew Frank, Peter Finch, Srikrishna Devabhaktuni, Rajeev Barua, Jonathan Babb,

More information

MODULE V. Internetworking: Concepts, Addressing, Architecture, Protocols, Datagram Processing, Transport-Layer Protocols, And End-To-End Services

MODULE V. Internetworking: Concepts, Addressing, Architecture, Protocols, Datagram Processing, Transport-Layer Protocols, And End-To-End Services MODULE V Internetworking: Concepts, Aressing, Architecture, Protocols, Datagram Processing, Transport-Layer Protocols, An En-To-En Services Computer Networks an Internets -- Moule 5 1 Spring, 2014 Copyright

More information

Coupon Recalculation for the GPS Authentication Scheme

Coupon Recalculation for the GPS Authentication Scheme Coupon Recalculation for the GPS Authentication Scheme Georg Hofferek an Johannes Wolkerstorfer Graz University of Technology, Institute for Applie Information Processing an Communications (IAIK), Inffelgasse

More information

On the Placement of Internet Taps in Wireless Neighborhood Networks

On the Placement of Internet Taps in Wireless Neighborhood Networks 1 On the Placement of Internet Taps in Wireless Neighborhoo Networks Lili Qiu, Ranveer Chanra, Kamal Jain, Mohamma Mahian Abstract Recently there has emerge a novel application of wireless technology that

More information

Software Reliability Modeling and Cost Estimation Incorporating Testing-Effort and Efficiency

Software Reliability Modeling and Cost Estimation Incorporating Testing-Effort and Efficiency Software Reliability Moeling an Cost Estimation Incorporating esting-effort an Efficiency Chin-Yu Huang, Jung-Hua Lo, Sy-Yen Kuo, an Michael R. Lyu -+ Department of Electrical Engineering Computer Science

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Inuence of Cross-Interferences on Blocked Loops: to know the precise gain brought by blocking. It is even dicult to determine for which problem

Inuence of Cross-Interferences on Blocked Loops: to know the precise gain brought by blocking. It is even dicult to determine for which problem Inuence of Cross-Interferences on Blocke Loops A Case Stuy with Matrix-Vector Multiply CHRISTINE FRICKER INRIA, France an OLIVIER TEMAM an WILLIAM JALBY University of Versailles, France State-of-the art

More information

Enabling Rollback Support in IT Change Management Systems

Enabling Rollback Support in IT Change Management Systems Enabling Rollback Support in IT Change Management Systems Guilherme Sperb Machao, Fábio Fabian Daitx, Weverton Luis a Costa Coreiro, Cristiano Bonato Both, Luciano Paschoal Gaspary, Lisanro Zambeneetti

More information

Indexing the Edges A simple and yet efficient approach to high-dimensional indexing

Indexing the Edges A simple and yet efficient approach to high-dimensional indexing Inexing the Eges A simple an yet efficient approach to high-imensional inexing Beng Chin Ooi Kian-Lee Tan Cui Yu Stephane Bressan Department of Computer Science National University of Singapore 3 Science

More information

Robust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks

Robust PIM-SM Multicasting using Anycast RP in Wireless Ad Hoc Networks Robust PIM-SM Multicasting using Anycast RP in Wireless A Hoc Networks Jaewon Kang, John Sucec, Vikram Kaul, Sunil Samtani an Mariusz A. Fecko Applie Research, Telcoria Technologies One Telcoria Drive,

More information

Introduction to Computing and Systems Architecture

Introduction to Computing and Systems Architecture Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little

More information

Experion PKS R500 Migration Planning Guide

Experion PKS R500 Migration Planning Guide Experion PKS R500 Migration Planning Guie EPDOC-XX70-en-500E May 2018 Release 500 Document Release Issue Date EPDOC-XX70- en-500e 500 0 May 2018 Disclaimer This ocument contains Honeywell proprietary information.

More information

CellMR: A Framework for Supporting MapReduce on Asymmetric Cell-Based Clusters

CellMR: A Framework for Supporting MapReduce on Asymmetric Cell-Based Clusters CellMR: A Framework for Supporting MapReduce on Asymmetric Cell-Based Clusters M. Mustafa Rafique 1, Benjamin Rose 1, Ali R. Butt 1 1 Dept. of Computer Science Virginia Tech. Blacksburg, Virginia, USA

More information

Performance Evaluation of a High Precision Software-based Timestamping Solution for Network Monitoring

Performance Evaluation of a High Precision Software-based Timestamping Solution for Network Monitoring 181 Performance Evaluation of a High Precision Software-base Timestamping Solution for Network Monitoring Peter Orosz, Tamas Skopko Faculty of Informatics University of Debrecen Debrecen, Hungary e-mail:

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Adaptive Load Balancing based on IP Fast Reroute to Avoid Congestion Hot-spots

Adaptive Load Balancing based on IP Fast Reroute to Avoid Congestion Hot-spots Aaptive Loa Balancing base on IP Fast Reroute to Avoi Congestion Hot-spots Masaki Hara an Takuya Yoshihiro Faculty of Systems Engineering, Wakayama University 930 Sakaeani, Wakayama, 640-8510, Japan Email:

More information

Comparison of Methods for Increasing the Performance of a DUA Computation

Comparison of Methods for Increasing the Performance of a DUA Computation Comparison of Methos for Increasing the Performance of a DUA Computation Michael Behrisch, Daniel Krajzewicz, Peter Wagner an Yun-Pang Wang Institute of Transportation Systems, German Aerospace Center,

More information

State Indexed Policy Search by Dynamic Programming. Abstract. 1. Introduction. 2. System parameterization. Charles DuHadway

State Indexed Policy Search by Dynamic Programming. Abstract. 1. Introduction. 2. System parameterization. Charles DuHadway State Inexe Policy Search by Dynamic Programming Charles DuHaway Yi Gu 5435537 503372 December 4, 2007 Abstract We consier the reinforcement learning problem of simultaneous trajectory-following an obstacle

More information

All About the Cell Processor

All About the Cell Processor All About the Cell H. Peter Hofstee, Ph. D. IBM Systems and Technology Group SCEI/Sony Toshiba IBM Design Center Austin, Texas Acknowledgements Cell is the result of a deep partnership between SCEI/Sony,

More information

A multiple wavelength unwrapping algorithm for digital fringe profilometry based on spatial shift estimation

A multiple wavelength unwrapping algorithm for digital fringe profilometry based on spatial shift estimation University of Wollongong Research Online Faculty of Engineering an Information Sciences - Papers: Part A Faculty of Engineering an Information Sciences 214 A multiple wavelength unwrapping algorithm for

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Optimal Routing and Scheduling for Deterministic Delay Tolerant Networks

Optimal Routing and Scheduling for Deterministic Delay Tolerant Networks Optimal Routing an Scheuling for Deterministic Delay Tolerant Networks Davi Hay Dipartimento i Elettronica olitecnico i Torino, Italy Email: hay@tlc.polito.it aolo Giaccone Dipartimento i Elettronica olitecnico

More information

arxiv: v2 [cs.dc] 8 Feb 2018

arxiv: v2 [cs.dc] 8 Feb 2018 SEVENTH FRAMEWORK PROGRAMME THEME ICT-2013.3.4 Avance Computing, Embee an Control Systems arxiv:1801.08761v2 [cs.dc] 8 Feb 2018 Execution Moels for Energy-Efficient Computing Systems Project ID: 611183

More information

PART 2. Organization Of An Operating System

PART 2. Organization Of An Operating System PART 2 Organization Of An Operating System CS 503 - PART 2 1 2010 Services An OS Supplies Support for concurrent execution Facilities for process synchronization Inter-process communication mechanisms

More information

Politehnica University of Timisoara Mobile Computing, Sensors Network and Embedded Systems Laboratory. Testing Techniques

Politehnica University of Timisoara Mobile Computing, Sensors Network and Embedded Systems Laboratory. Testing Techniques Politehnica University of Timisoara Mobile Computing, Sensors Network an Embee Systems Laboratory ing Techniques What is testing? ing is the process of emonstrating that errors are not present. The purpose

More information

A Buffered-Mode MPI Implementation for the Cell BE Processor

A Buffered-Mode MPI Implementation for the Cell BE Processor A Buffered-Mode MPI Implementation for the Cell BE Processor Arun Kumar 1, Ganapathy Senthilkumar 1, Murali Krishna 1, Naresh Jayam 1, Pallav K Baruah 1, Raghunath Sharma 1, Ashok Srinivasan 2, Shakti

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Automation of Bird Front Half Deboning Procedure: Design and Analysis

Automation of Bird Front Half Deboning Procedure: Design and Analysis Automation of Bir Front Half Deboning Proceure: Design an Analysis Debao Zhou, Jonathan Holmes, Wiley Holcombe, Kok-Meng Lee * an Gary McMurray Foo Processing echnology Division, AAS Laboratory, Georgia

More information

Improving Spatial Reuse of IEEE Based Ad Hoc Networks

Improving Spatial Reuse of IEEE Based Ad Hoc Networks mproving Spatial Reuse of EEE 82.11 Base A Hoc Networks Fengji Ye, Su Yi an Biplab Sikar ECSE Department, Rensselaer Polytechnic nstitute Troy, NY 1218 Abstract n this paper, we evaluate an suggest methos

More information

Probabilistic Medium Access Control for. Full-Duplex Networks with Half-Duplex Clients

Probabilistic Medium Access Control for. Full-Duplex Networks with Half-Duplex Clients Probabilistic Meium Access Control for 1 Full-Duplex Networks with Half-Duplex Clients arxiv:1608.08729v1 [cs.ni] 31 Aug 2016 Shih-Ying Chen, Ting-Feng Huang, Kate Ching-Ju Lin, Member, IEEE, Y.-W. Peter

More information

Questions? Post on piazza, or Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)!

Questions? Post on piazza, or  Radhika (radhika at eecs.berkeley) or Sameer (sa at berkeley)! EE122 Fall 2013 HW3 Instructions Recor your answers in a file calle hw3.pf. Make sure to write your name an SID at the top of your assignment. For each problem, clearly inicate your final answer, bol an

More information

Divide-and-Conquer Algorithms

Divide-and-Conquer Algorithms Supplment to A Practical Guie to Data Structures an Algorithms Using Java Divie-an-Conquer Algorithms Sally A Golman an Kenneth J Golman Hanout Divie-an-conquer algorithms use the following three phases:

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation

Solution Representation for Job Shop Scheduling Problems in Ant Colony Optimisation Solution Representation for Job Shop Scheuling Problems in Ant Colony Optimisation James Montgomery, Carole Faya 2, an Sana Petrovic 2 Faculty of Information & Communication Technologies, Swinburne University

More information

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous

More information

E2EM-X4X1 2M *2 E2EM-X4X2 2M Shielded E2EM-X8X1 2M *2 E2EM-X8X2 2M *1 M30 15 mm E2EM-X15X1 2M *2 E2EM-X15X2 2M

E2EM-X4X1 2M *2 E2EM-X4X2 2M Shielded E2EM-X8X1 2M *2 E2EM-X8X2 2M *1 M30 15 mm E2EM-X15X1 2M *2 E2EM-X15X2 2M Long-istance Proximity Sensor EEM CSM_EEM_DS_E Long-istance Proximity Sensor Long-istance etection at up to mm enables secure mounting with reuce problems ue to workpiece collisions. No polarity for easy

More information

E2EQ-X10D1-M1TGJ 0.3M

E2EQ-X10D1-M1TGJ 0.3M Spatter-resistant Proximity Sensor EEQ CSM_EEQ_DS_E_9_ Spatter-resistant Fluororesincoate Proximity Sensor Superior spatter resistance. Long Sensing-istance s ae for sensing istances up to mm. Pre-wire

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE. Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose

DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE. Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose DATA PARALLEL FPGA WORKLOADS: SOFTWARE VERSUS HARDWARE Peter Yiannacouras, J. Gregory Steffan, an Jonathan Rose Ewar S. Rogers Sr. Department of Electrical an Computer Engineering University of Toronto

More information

Preamble. Singly linked lists. Collaboration policy and academic integrity. Getting help

Preamble. Singly linked lists. Collaboration policy and academic integrity. Getting help CS2110 Spring 2016 Assignment A. Linke Lists Due on the CMS by: See the CMS 1 Preamble Linke Lists This assignment begins our iscussions of structures. In this assignment, you will implement a structure

More information

A Measurement Framework for Pin-Pointing Routing Changes

A Measurement Framework for Pin-Pointing Routing Changes A Measurement Framework for Pin-Pointing Routing Changes Renata Teixeira Univ. Calif. San Diego La Jolla, CA teixeira@cs.ucs.eu Jennifer Rexfor AT&T Labs Research Florham Park, NJ jrex@research.att.com

More information

Wireless Sensing and Structural Control Strategies

Wireless Sensing and Structural Control Strategies Wireless Sensing an Structural Control Strategies Kincho H. Law 1, Anrew Swartz 2, Jerome P. Lynch 3, Yang Wang 4 1 Dept. of Civil an Env. Engineering, Stanfor University, Stanfor, CA 94305, USA 2 Dept.

More information