Multi-Channel Neural Spike Detection and Alignment on GiDEL PROCStar IV 530 FPGA Platform

Size: px
Start display at page:

Download "Multi-Channel Neural Spike Detection and Alignment on GiDEL PROCStar IV 530 FPGA Platform"

Transcription

1 UNIVERSITY OF CALIFORNIA, LOS ANGELES Multi-Channel Neural Spike Detection and Alignment on GiDEL PROCStar IV 530 FPGA Platform Aria Sarraf (SID: ) 12/8/2014

2 Abstract In this report I present a prototype design of the GiDEL PROCStar IV 530 FPGA that is capable of processing 192 simultaneous channels for neural spike detection and alignment. The emphasis of this design is to support the maximum number of neural channels while attaining high overall throughput. This report explains the limiting factors for the number of channels and how a single channel throughput of 32 Mbps is achieved for a 192-channel configuration. It also gives details of the contributing factors in latency and the methodologies for minimizing it. Furthermore, it demonstrates how this system provides a framework for implementing a complete spike sorting algorithm.

3 Table of Contents 1 Introduction Motivation Project Goals GiDEL PROCStar IV 530 FPGA Platform Functionality of Current Prototype Implementation Software Hardware Number of Channels Memory Multiplexor Memory Control Packet Size Detection Alignment Scaling the Number of Channels Results Accuracy Performance Utilization Future Work Conclusion References... 16

4 1 Introduction 1.1 Motivation The detection of neural spikes is a technical challenge that is essential for analyzing many types of brain function. Neural spike detection begins by extracting neural activity via an electrode, and the electrode is able to measure the activity as a voltage. This voltage is then sampled at rates of khz with resolutions ranging from bits per sample. However, these signals often contain high amount of background noise, which makes it difficult to accurately identify the neural spikes. Therefore, digital processing techniques algorithms are required to overcome this difficulty. With high sampling rates and resolutions, the neural recordings will occupy a tremendous amount of data. This places a high demand on computational resources to process all of that data. While it is possible to process this data using software on a general purpose CPU, the processing rate will be slow (0.94 Mpbs [1]). However, by utilizing digital processors designed to process neural data, computation times can be reduced by up to three orders of magnitude [1]. These processors are referred to as fieldprogrammable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). 1.2 Project Goals The goal of this report is to demonstrate the performance capabilities of the PROCStar IV 530 for multi-channel neural spike detection and alignment. The emphasis of the current design prototype is to support the maximum number of neural channels while achieving high performance in the following procedures: transferring data from the host to the FPGA platform memory allocation during FPGA run-time transferring data back from the FPGA platform to the host This design will include a spike detection and an alignment block as a framework for future development. While choosing a sophisticated algorithm for each of these blocks is not the intent of this project, they will primarily be used to model the latencies and ensure proper functionality of the system. 1.3 GiDEL PROCStar IV 530 FPGA Platform Using a high-capacity, high-speed FPGA platform is paramount to reducing the computation time of spike detection and alignment. The platform used for this project is the GiDEL PROCStar IV 530. This platform is an 8-lane PCIe hosted system with 4 Altera Stratix IV 530 FPGAs, capable of running at system speeds of up to 300MHz. This system is equipped with a total of 18 GB of DDR2 memory (16GB of external memory and 2GB of on-board memory), which is more than sufficient for the needs of this application. 4

5 Figure 1: System Overview of FPGA GiDEL also offers a hardware-software integration application called ProcWizard with the purpose of simplifying the project development task. This software allows for automatic generation of PCI drivers and internal buses. Every item in the design, such as memories, registers, and modules, can be defined in the application and can automatically generate a template with its corresponding hardware and software design files. These hardware and software files make up an interface unit, which handle the host communication protocol for the system. As shown in Figure 1, all text and arrows in blue are elements that are implemented by the developer, and GiDEL ProcWizard automatically configures all other elements. ProcWizard also offers several GiDEL IP cores that utilize on-board memory to generate large delay lines, advanced memory controllers, and controllers to transfer data between sub-designs. The IP cores include ProcMegaDelay, ProcMultiPort, and ProcMegaFIFO. ProcMultiPort was used extensively in this project and provides for efficient usage of the memory banks. It effectively converts on-board memory to a true multi-port memory. This IP core allows for 16 ports (each with different port widths), to run simultaneously on each memory bank. 2 Functionality of Current Prototype The input data requirements for this design are based on Sara Gibson s report [1]. The current prototype s input data requirement for each sample s word size is 16 bits (8 integer bits and 8 fractional bits), all contained in a binary file. The integer bits are signed and can support values from -128 to 128 while the fractional bits are unsigned and can 5

6 support an accuracy of 1/256. Each binary file will be treated as a separate channel. A MATLAB script provided by Sara Gibson quantizes, filters, and converts the data to this format. After processing all of the data in MATLAB, the resulting binary files are placed in the input directory of the FPGA Spike Detecting/Alignment Platform. The number of files in the folder will correspond to the number of channels that the system will be configured for. The user will then configure two parameters related to the spike detection and alignment algorithm. These parameters are the spike threshold and spike width. An additional parameter is available to either make all the data in a single file or split into multiple files. After executing the program, the resulting data from the FPGA will be transferred to a.dat file on the host PC. This data will contain the timestamp of the aligned spike, the value corresponding to the spike, and the channel from which the spike came from. 3 Implementation 3.1 Software The software design for this prototype is developed on C++. GiDEL provides the necessary libraries for C++ applications to communicate to the FPGA via the PCIe bus. These libraries contain the application programming interface (API) to perform hardware initializations, load FPGA designs for the host s hard disk drive, and configure registers on the FPGA. The ProcWizard application generates the header file containing the offsets for the user-defined registers and memories that can be used in conjunction with the GiDEL libraries. The most critical feature of the C++ application is direct memory access (DMA). GiDEL provides two methods for DMA: simple mode and effective mode. Both modes have their own tradeoffs. Simple mode is able to execute all DMA operations in a single function call. It is called simple mode because it does not need preparation stages or additional variables. The cost to this simplicity is that it requires software overhead in the allocation and release of OS resources, which ultimately degrades its transfer rate. Effective mode is faster than simple mode and requires manually creating handles, buffers, performing the transfer, and releasing the resources. Since effective mode enables higher throughput than simple mode, effective mode is used in this prototype. 6

7 3.2 Hardware Figure 2: Block Diagram of the processing elements on the FPGA. The current prototype operates in several distinct blocks as shown in Figure 2. While the signals on the block diagram are referring to a single channel, they can be extended for a multi-channel configuration as well. The design blocks start after the DMA is complete and the GLBL_EN is toggled from 0 to 1. The memory multiplexor first multiplexes all the single-channel data to form a single multi-channel data in memory. After this step, the memory control sends raw data to the detector. When the detector finds a spike, the aligner will be triggered to start. After the aligner processes the next spike_width samples, the aligner sends the spike s timestamp and peak value to the memory control block. The memory controller will then store this data in the SRAM. Once all of the samples are processed, the host PC retrieves the data from the SRAM via DMA. 3.3 Number of Channels One of the vital goals for this prototype is to simultaneously process as many channels of neural data as possible. Understanding the limitations that contribute to the maximum number of channels will be necessary when designing a sound architecture for the program. The maximum number of channels is limited by the throughput of the memory banks. The throughput of each memory bank is dependent on the memory bandwidth, the word size of the memory bank, and the access rate efficiency. The access rate efficiency accounts for the hardware overhead associated to the data transfer and is reported in the GiDEL data book [2]. The throughput is calculated as: Throughput = maximum memory clock frequency * word width * access rate efficiency Throughput= 666 MHz * 8 bytes * 75% =3.996 GB/s Once the throughput of the memory banks is known, the operating clock frequency of the prototype and the word size of the neural data can be used to calculate the number of channels. Given that each channel occupies 16 bits of data and the configured clock 7

8 frequency is set at 125MHz, the maximum number of channels that a memory bank can process is 16 channels. Since each memory bank can support 16 channels and there are 12 memory banks, this FPGA can support a total of 192 channels. Memory Throughput Maximum # of Channels= Clock Frequency*Input Word Size 4 GB/s = *12= 192 channels 125 MHz*16 bits *(# of memory banks) 3.4 Memory Multiplexor One of the significant challenges in this design is how to route the binary data from the host PC to the FPGA and back, with minimal cost to latency, while supporting 192 channels of data. The deciding factor on how data is managed is primarily influenced by the capabilities of ProcMultiPort and the bandwidth of the SRAM. As mentioned earlier, ProcMultiPort enables each memory bank to support up to 16 ports. These ports can either be constructed to randomly access or sequentially access the memory. Randomly accessing the memory causes major degradation of the overall throughput [2] and thus all configurations in the prototype use sequential access mode. Note that each port on the host PC can be configured up to 128 bits while each port on the FPGA can be configured up to 256 bits. In the following paragraphs, there will be a discussion comparing three configurations for ProcMultiPort. The first two configurations will have varying latencies and number of channels associated to that configuration, and their shortcomings will lead to why a third configuration was proposed and used for the prototype design. Assuming the timestamps of each spike is 32 bits, all configurations will use a 64-bit port writer on the FPGA that contains the spike timestamp, spike value at the timestamp, and the channel of the timestamp. Since more than one aligned spike can arrive at any given clock cycle, this would require the data to be buffered to accommodate that port. Otherwise, the system would need to be designed with sixteen 64-bit write ports to write data if all channels have an aligned spike on the same clock cycle. This is not feasible in this system, which is why buffering is considered with a 64-bit write port. Assuming the binary data is transferred directly from the host PC without additional processing, a possible configuration shown below can be used for the 16 available channels: Configuration # 1 Port Size # of System Read/Write Function (bits) Ports 128 Read 1 Host DMA 128 Write 1 64 Write 1 FPGA Detection/Alignment 16 Read 13 ProcMultiPort Utilization (remaining/total) 16/16 8

9 This configuration allows for a maximum of 13 channels per memory bank, which falls short of the 16-channel target described earlier. Another method to maximize the number of channels is to group-up or multiplex the input data before sending it to the FPGA. This effectively turns multiple single channel buffers into one multi-channel buffer in memory. By grouping the data, the FGPA can use one 256 bit read port to read the data simultaneously. This would create a configuration shown below: Configuration # 2 System Host FPGA Port Size (bits) Read/Write 9 # of Ports 128 Read Write 1 64 Write Read 1 Function DMA Detection/Alignment ProcMultiPort Utilization 4/16 (remaining/total) While this configuration appears reasonable and intuitive, it adds a major cost to latency when the host PC multiplexes the data. This strategy requires additional allocation of memory on the host PC and extra processing time to move the data from multiple single-channel buffers to one multi-channel buffer. The added latency for the C++ program to multiplex 16 channels with 8 MB was measured with a latency of 737ms. When this process is completed for all 12 memory banks on the FPGA, the added latency is 11.8 seconds. When considering 11.8 seconds, the effective throughput of a single channel is 5.36Mbps which severely bottlenecks the system. Another proposed approach, and the solution used in the current prototype, integrates a combination of both configurations discussed above. Instead of using the host PC to multiplex the data, the FPGA implements this process. Since the FPGA is much faster in processing data than the host PC, latency will be dramatically reduced when compared to the second approach. It will also meet the 16-channel goal per memory bank. This design requires ProcMultiPort to be configured in the following way: System Host FPGA FPGA Configuration # 3 Port Size # of Read/Write (bits) Ports 128 Read Write 1 64 Write Read Write 1 16 Read 8 Function DMA Detection/Alignment Multiplexor ProcMultiPort Utilization 14/16 (remaining/total) As shown above, the main difference in the configuration is the ports used in multiplexing data on the FPGA. The two 128-bit ports are functionally equivalent to one

10 256-bit port but are configured that way as a result to the design of the multiplexer. Ideally, the ports containing the multiplexed data would be 256 bits, but that would require sixteen 16-bit read ports, which would require a total of 21 ports, exceeding the 16-port maximum. Instead the 128-bit write port and eight 16-bit read ports are run twice in separate addresses in the FPGA memory. This process is shown in Figure 3. Figure 3: Data flow in the memory multiplexer The additional latency added to multiplex the data on the FPGA for 192 channels, each with 8 MB, was measured to be 160 ms. Thus, the effective throughput of a single channel during this process is 400 Mpbs. When comparing this result to the result of the same process in software (Configuration 2), a 74x improvement was attained in latency in the proposed configuration. 3.5 Memory Control The purpose of the memory control block is to read in new raw data in the system and output the spike timestamp, spike value, and channel number in FPGA memory. It also synchronizes all other blocks so that the multi-channel data have relative timestamps. One of the main challenges that was overcome in the current prototype was how to write out all the data during multi-channel operation. As mentioned earlier, it is not feasible to dedicate a write port for every channel, and thus, the 64-bit write port with buffering was proposed. Buffering the data output would require an extra FIFO on each channel. Implementing a FIFO on each channel can benefit the system in two major ways. First, it reduces the number of utilized ports in each memory bank since multiple channels can share the same port. Secondly, it reduces the overall latency in the system. Because the multi-channel data is read in at the maximum bandwidth of the memory bank, any time a write operation is performed, the reader is required to pause, which adds latency. If a buffer is not used in the design, and a one port per channel scheme was possible, the read port would constantly be halting. The latency would be variable depending on how spread apart the spikes are from channel to channel. If all the spikes arrived at the same time, the read port would only pause for one clock cycle since the write ports only need 10

11 one clock cycle to run. Buffering operates on the sample principle in that once all the buffers are filled, the write port can push out all the data and pause the reader for one clock cycle. Each channel in the memory control block has a dedicated buffer with fifteen 48-bit wide registers. Each element will store the spike timestamp (32 bits) and spike peak value (16 bits). During every clock cycle, the memory control block will wait for any one of the aligner processes to complete an execution. After the memory control block receives the done signal, it will store the aligned data to the corresponding channels buffer and increment the array counter on the buffer. Once any one of the buffers become full, the memory control block pauses the read port and transfers the data from all of the buffers to the memory banks. After the array counter on every buffer is reset, the read port resumes. 3.6 Packet Size The upper limit on the packet size of each channel is influenced by the maximum capacity of the smallest memory bank, the number of channels, and the design of the memory multiplexor block. In context of this project, a packet is defined as the amount of data per channel that the host transfers to the FPGA before the FPGA processes it. Each packet contains data pertaining only to that channel. The upper limit on the packet size of this system is: SRAM_size 512 MB Packet Size= = num_of_channels*2 16*2 = 8MB The packet size is limited to the size of the smallest memory bank, so that all channels occupy the memory space equally on each memory bank. One of the consequences of the memory multiplexor block is that it utilizes the top half of each memory bank. Therefore, the data packets of each channel will be available at the bottom half the memory bank. 3.7 Detection The purpose of the detection block is to determine whether the current sample is a spike or not. A simple algorithm was used in this block: if the current sample is greater than the user defined threshold, a spike is detected. While the algorithm is simple, a separate Verilog module is dedicated for this, which decouples it from other blocks. This enables other detection schemes to easily be incorporated in the design. 11

12 3.8 Alignment Figure 4: Single-channel raw-data showing which samples the detector/aligner begin evaluating The purpose of the aligner is to report a timestamp indicating the starting position of a common characteristic of a spike. In the current prototype, the aligner reports the spike based on its maximum value. The aligner works on a window of samples after the detector determines a threshold cross. Figure 4 shows an example plot of the raw data. The red bolded overlay indicates the points that the aligner will evaluate. The aligner operates in the following scheme: 1. Run program if the spike is detected and the program is not already evaluating a spike 2. Store the first timestamp to initial_timestamp 3. Store the first timestamp and value to spike_peak_pos and spike_timestamp respectively 4. Receive new sample and increment the sample timestamp 5. Check if the new sample is greater than the spike_peak_pos. If it is greater, store the new sample value and the corresponding timestamp 6. Repeat steps, starting at step 5 until (timestamp-initial_timestamp) == spike_width 7. Set spike_valid to high so memory control can grab the peak value and timestamp 8. Repeat steps, starting at step 1 One of the advantages to this alignment method is that it does not add any latency to the system. Every sample sent to the aligner is processed on the same clock cycle. However, one of the prerequisites of this method requires that the raw data does not have spikes that overlap. If two spikes appear within spike_width samples, the aligner will choose the spike with the largest peak value. Therefore, to avoid misinterpretation of any 12

13 spikes in this process, it is necessary to choose a spike width setting most representative of the data. 3.9 Scaling the Number of Channels It was possible to scale from 1 channel to 192 channels due to the architectural design of the memory multiplexor block and the memory control block. After validating proper functionality of a single channel, the number of channels was increased to 16 channels, the maximum number of channels on each memory bank. Each memory bank consists of one memory multiplexor, one memory controller, 16 detector modules, and 16 aligner modules. The number of channels was then increased to 48, the maximum number of channels supported on each IC. The design was then duplicated on every IC, amounting to 192 channels. 4 Results The input to the following test consisted of 16bit inputs sampled at khz. There were 192 input files, each with a size of 8 MB. The threshold for the detector and the spike width for the aligner were set to the value Accuracy To determine whether the program is able to find the spikes properly, prior to running the program, I manually recorded where the peaks are. This required visually inspecting the spikes on a plot. This was used to compare to the output file generated from the prototype. In all cases where the spike was greater than the threshold value of 30, the system was able to accurately report the timestamp of the maximum value of each peak. 4.2 Performance The metrics for this prototype involved measuring the latency and calculating the throughput. The latency and throughput for the following processes were measured as: TX latency: duration to transfer the sample data from the host s RAM to the FPGA s RAM FPGA processing latency: duration for the FPGA to read from its memory banks, detect and align the spikes, and write back to its memory banks RX latency: duration to transfer the spike data from the FPGA s RAM to the host s RAM Total latency: the sum all of all TX, FPGA processing, and RX latencies Throughput of each channel: the throughput of each channel during its corresponding multi-channel configuration Throughput of all channels: the aggregated throughput of all the channels 13

14 Table 1: Performance Metrics Number of Channels File Size / channel 8MB 8MB Packet Size / channel 8MB 8MB TX latency 447 ms 1789 ms FPGA Processing latency 202 ms 203 ms RX latency 10 ms 35 ms Total latency 659 ms 2027 ms Throughput of each channel 97 Mbps 32 Mbps Throughput of all channels 4662 Mbps 6062 Mbps Table 1 shows the performance of the prototype against these metrics. It is clear that the bottleneck to the system is the DMA transfer from the host PC to the FPGA. In the 192 channel configuration, TX accounts for 88% of the total latency. While this may seem significant, the data rate during this transfer was approximately 6.86 Gb/sec, which is quite impressive in today s standards. The data rate of the DMA transfer is roughly the same as the benchmark reported by GiDEL s data book [2]. The bus for the DMA transfer is PCI express with 8 active channels. The throughput of each neural channel data was benchmarked at 31.6 Mbps and 97 Mbps for the 48 channel and 192 channel configuration. Since each channel is sampled at khz, the input data rate at which the samples are recorded is.44mbps. Thus, this prototype can process data at 71x the rate of the input data when all channels all 192 channels are being used and 220x the rate of the input data when 48 channels are used. 4.3 Utilization This design occupies less than a fifth of the resources available on the FPGA; 14% logic utilization, 8% combinational adaptive look-up tables (ALUT), 1% memory ALUTs, and 11% dedicated logic registers. These additional resources allow for the design of more complex detection and alignment schemes. Moreover, future work on spike sorting algorithms can take advantage of the memory ALUTs when creating multiple clusters. Table 2: FPGA resource utilization Logic Utilization 14% Combinational ALUTs 34,304 / 424,960 (8%) Memory ALUTs 816 / 212,480 (<1%) Dedicated logic registers 48,310 / 424,960 (11%) 14

15 5 Future Work Figure 5: Block Diagram of the processing elements for a spike sorter system Achieving multi-channel spike-sorting is essential to uniquely identify each spike. Because each neuron s signal may be shared across multiple channels, this current prototype will detect multiple spikes from the same neuron. This prototype was designed with the intention that a spike sorting module will be developed on it in the future. Figure 5 shows a block diagram of a system. This block diagram is similar to that of Figure 2, except it includes a spike sorter block that added in between the aligner and memory control block. The spike sorter will retrieve multi-channel spike timestamps and peak values, and determine which clusters those spikes came from. This will all be send to the memory control which will eventually be sent to the host. 6 Conclusion This report has demonstrated that the GiDEL PROCStar 530 is capable of processing data at 71x the rate of the input data when all 192 channels are being used and 220x the rate of the input data when 48 channels are used. It has stated why 192 neural channels was chosen as the target for number of neural channels for the design and how it was achieved. It also gives details of the challenges, specifically for memory allocation, in the design process and how they were overcome. While the current prototype focuses on multi-channel neural processing and its performance, the main ambition for this system is to support spike-sorting. This prototype does provide the necessary framework to support spike sorting. Essentially, the system can be treated as a black-box. The prototype provides the necessary inputs for a spike sorter (spike timestamp and peak value) and a means to transfer the resulting spike sorting outputs to the host. 15

16 References [1] Sarah Gibson, Jack W. Judy, Dejan Marković, An FPGA-based platform for accelerated offline spike sorting, Journal of Neuroscience Methods, Volume 215, Issue 1, 30 April 2013, Pages 1-11, ISSN Reference #2 [2] GiDEL, PROCStar IV Data Book, [3] Michael Lewicki, A review of methods for spike sorting: the detection and classification of neural action potentials, Network: Computer Neural Sys. 9, PII: S X(98)

Neural Spike Detection and Alignment on the Gidel ProcStar IV FPGA Platform

Neural Spike Detection and Alignment on the Gidel ProcStar IV FPGA Platform UNIVERSITY OF CALIFORNIA, LOS ANGELES Neural Spike Detection and Alignment on the Gidel ProcStar IV FPGA Platform (SID: 004-253-461) March 2015 Abstract Modern advances in neurological research and brain

More information

Interconnection Structures. Patrick Happ Raul Queiroz Feitosa

Interconnection Structures. Patrick Happ Raul Queiroz Feitosa Interconnection Structures Patrick Happ Raul Queiroz Feitosa Objective To present key issues that affect interconnection design. Interconnection Structures 2 Outline Introduction Computer Busses Bus Types

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

D Demonstration of disturbance recording functions for PQ monitoring

D Demonstration of disturbance recording functions for PQ monitoring D6.3.7. Demonstration of disturbance recording functions for PQ monitoring Final Report March, 2013 M.Sc. Bashir Ahmed Siddiqui Dr. Pertti Pakonen 1. Introduction The OMAP-L138 C6-Integra DSP+ARM processor

More information

Solving the Data Transfer Bottleneck in Digitizers

Solving the Data Transfer Bottleneck in Digitizers Solving the Data Transfer Bottleneck in Digitizers With most modern PC based digitizers and data acquisition systems a common problem is caused by the fact that the ADC technology usually runs in advance

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Improving DPDK Performance

Improving DPDK Performance Improving DPDK Performance Data Plane Development Kit (DPDK) was pioneered by Intel as a way to boost the speed of packet API with standard hardware. DPDK-enabled applications typically show four or more

More information

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Organization Part II Ujjwal Guin, Assistant Professor Department of Electrical and Computer Engineering Auburn University, Auburn,

More information

Bandwidth, Area Efficient and Target Device Independent DDR SDRAM Controller

Bandwidth, Area Efficient and Target Device Independent DDR SDRAM Controller Bandwidth, Area Efficient and Target Device Independent DDR SDRAM T. Mladenov, F. Mujahid, E. Jung, and D. Har Abstract The application of the synchronous dynamic random access memory (SDRAM) has gone

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

The Memory Component

The Memory Component The Computer Memory Chapter 6 forms the first of a two chapter sequence on computer memory. Topics for this chapter include. 1. A functional description of primary computer memory, sometimes called by

More information

AN 575: PCI Express-to-DDR2 SDRAM Reference Design

AN 575: PCI Express-to-DDR2 SDRAM Reference Design AN 575: PCI Express-to-DDR2 SDRAM Reference Design April 2009 AN-575-1.0 Introduction This application note introduces the dedicated PCI Express logic block implemented in Arria II GX FPGA hardware and

More information

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices

White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices Introduction White Paper The Need for a High-Bandwidth Memory Architecture in Programmable Logic Devices One of the challenges faced by engineers designing communications equipment is that memory devices

More information

Multimedia Decoder Using the Nios II Processor

Multimedia Decoder Using the Nios II Processor Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra

More information

TMS320C6678 Memory Access Performance

TMS320C6678 Memory Access Performance Application Report Lit. Number April 2011 TMS320C6678 Memory Access Performance Brighton Feng Communication Infrastructure ABSTRACT The TMS320C6678 has eight C66x cores, runs at 1GHz, each of them has

More information

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Quiz for Chapter 6 Storage and Other I/O Topics 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: 1. [6 points] Give a concise answer to each of the following

More information

TECHNOLOGY BRIEF. Double Data Rate SDRAM: Fast Performance at an Economical Price EXECUTIVE SUMMARY C ONTENTS

TECHNOLOGY BRIEF. Double Data Rate SDRAM: Fast Performance at an Economical Price EXECUTIVE SUMMARY C ONTENTS TECHNOLOGY BRIEF June 2002 Compaq Computer Corporation Prepared by ISS Technology Communications C ONTENTS Executive Summary 1 Notice 2 Introduction 3 SDRAM Operation 3 How CAS Latency Affects System Performance

More information

An FPGA-Based Optical IOH Architecture for Embedded System

An FPGA-Based Optical IOH Architecture for Embedded System An FPGA-Based Optical IOH Architecture for Embedded System Saravana.S Assistant Professor, Bharath University, Chennai 600073, India Abstract Data traffic has tremendously increased and is still increasing

More information

With Fixed Point or Floating Point Processors!!

With Fixed Point or Floating Point Processors!! Product Information Sheet High Throughput Digital Signal Processor OVERVIEW With Fixed Point or Floating Point Processors!! Performance Up to 14.4 GIPS or 7.7 GFLOPS Peak Processing Power Continuous Input

More information

S2C K7 Prodigy Logic Module Series

S2C K7 Prodigy Logic Module Series S2C K7 Prodigy Logic Module Series Low-Cost Fifth Generation Rapid FPGA-based Prototyping Hardware The S2C K7 Prodigy Logic Module is equipped with one Xilinx Kintex-7 XC7K410T or XC7K325T FPGA device

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

The CoreConnect Bus Architecture

The CoreConnect Bus Architecture The CoreConnect Bus Architecture Recent advances in silicon densities now allow for the integration of numerous functions onto a single silicon chip. With this increased density, peripherals formerly attached

More information

PXDAC4800. Product Information Sheet. 1.2 GSPS 4-Channel Arbitrary Waveform Generator FEATURES APPLICATIONS OVERVIEW

PXDAC4800. Product Information Sheet. 1.2 GSPS 4-Channel Arbitrary Waveform Generator FEATURES APPLICATIONS OVERVIEW Product Information Sheet PXDAC4800 1.2 GSPS 4-Channel Arbitrary Waveform Generator FEATURES 4 AC-Coupled or DC-Coupled DAC Channel Outputs 14-bit Resolution @ 1.2 GSPS for 2 Channels or 600 MSPS for 4

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction In a packet-switched network, packets are buffered when they cannot be processed or transmitted at the rate they arrive. There are three main reasons that a router, with generic

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

A (Very Hand-Wavy) Introduction to. PCI-Express. Jonathan Heathcote

A (Very Hand-Wavy) Introduction to. PCI-Express. Jonathan Heathcote A (Very Hand-Wavy) Introduction to PCI-Express Jonathan Heathcote Motivation Six Week Project Before PhD Starts: SpiNNaker Ethernet I/O is Sloooooow How Do You Get Things In/Out of SpiNNaker, Fast? Build

More information

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses

Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses Introduction Electrical Considerations Data Transfer Synchronization Bus Arbitration VME Bus Local Buses PCI Bus PCI Bus Variants Serial Buses 1 Most of the integrated I/O subsystems are connected to the

More information

Multimedia Streaming. Mike Zink

Multimedia Streaming. Mike Zink Multimedia Streaming Mike Zink Technical Challenges Servers (and proxy caches) storage continuous media streams, e.g.: 4000 movies * 90 minutes * 10 Mbps (DVD) = 27.0 TB 15 Mbps = 40.5 TB 36 Mbps (BluRay)=

More information

3. HARDWARE ARCHITECTURE

3. HARDWARE ARCHITECTURE 3. HARDWARE ARCHITECTURE The architecture of the Recognition Accelerator consists of two main parts: a dedicated classifier engine and a general-purpose 16-bit microcontroller. The classifier implements

More information

Top-Level View of Computer Organization

Top-Level View of Computer Organization Top-Level View of Computer Organization Bởi: Hoang Lan Nguyen Computer Component Contemporary computer designs are based on concepts developed by John von Neumann at the Institute for Advanced Studies

More information

FEATURE. High-throughput Video Data Transfer of Striping with SSDs for 8K Super Hi-Vision

FEATURE. High-throughput Video Data Transfer of Striping with SSDs for 8K Super Hi-Vision High-throughput Video Data Transfer of Striping with s for 8K Super Hi-Vision Takeshi KAJIYAMA, Kodai KIKUCHI and Eiichi MIYASHITA We have developed a high-throughput recording and playback method for

More information

Architecture of Computers and Parallel Systems Part 2: Communication with Devices

Architecture of Computers and Parallel Systems Part 2: Communication with Devices Architecture of Computers and Parallel Systems Part 2: Communication with Devices Ing. Petr Olivka petr.olivka@vsb.cz Department of Computer Science FEI VSB-TUO Architecture of Computers and Parallel Systems

More information

INT G bit TCP Offload Engine SOC

INT G bit TCP Offload Engine SOC INT 10011 10 G bit TCP Offload Engine SOC Product brief, features and benefits summary: Highly customizable hardware IP block. Easily portable to ASIC flow, Xilinx/Altera FPGAs or Structured ASIC flow.

More information

Buses. Maurizio Palesi. Maurizio Palesi 1

Buses. Maurizio Palesi. Maurizio Palesi 1 Buses Maurizio Palesi Maurizio Palesi 1 Introduction Buses are the simplest and most widely used interconnection networks A number of modules is connected via a single shared channel Microcontroller Microcontroller

More information

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University

Chapter 3. Top Level View of Computer Function and Interconnection. Yonsei University Chapter 3 Top Level View of Computer Function and Interconnection Contents Computer Components Computer Function Interconnection Structures Bus Interconnection PCI 3-2 Program Concept Computer components

More information

Chapter 3 - Top Level View of Computer Function

Chapter 3 - Top Level View of Computer Function Chapter 3 - Top Level View of Computer Function Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 3 - Top Level View 1 / 127 Table of Contents I 1 Introduction 2 Computer Components

More information

ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design ECE232: Hardware Organization and Design Lecture 21: Memory Hierarchy Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Ideally, computer memory would be large and fast

More information

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Hot Interconnects 2014 End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet Green Platform Research Laboratories, NEC, Japan J. Suzuki, Y. Hayashi, M. Kan, S. Miyakawa,

More information

Module 6: INPUT - OUTPUT (I/O)

Module 6: INPUT - OUTPUT (I/O) Module 6: INPUT - OUTPUT (I/O) Introduction Computers communicate with the outside world via I/O devices Input devices supply computers with data to operate on E.g: Keyboard, Mouse, Voice recognition hardware,

More information

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system Objective To understand the major factors that dictate performance when using GPU as an compute co-processor

More information

HPCC Random Access Benchmark Excels on Data Vortex

HPCC Random Access Benchmark Excels on Data Vortex HPCC Random Access Benchmark Excels on Data Vortex Version 1.1 * June 7 2016 Abstract The Random Access 1 benchmark, as defined by the High Performance Computing Challenge (HPCC), tests how frequently

More information

Computer Architecture and Organization (CS-507)

Computer Architecture and Organization (CS-507) Computer Architecture and Organization (CS-507) Muhammad Zeeshan Haider Ali Lecturer ISP. Multan ali.zeeshan04@gmail.com https://zeeshanaliatisp.wordpress.com/ Lecture 4 Basic Computer Function, Instruction

More information

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections )

Lecture 23: Storage Systems. Topics: disk access, bus design, evaluation metrics, RAID (Sections ) Lecture 23: Storage Systems Topics: disk access, bus design, evaluation metrics, RAID (Sections 7.1-7.9) 1 Role of I/O Activities external to the CPU are typically orders of magnitude slower Example: while

More information

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010

SEMICON Solutions. Bus Structure. Created by: Duong Dang Date: 20 th Oct,2010 SEMICON Solutions Bus Structure Created by: Duong Dang Date: 20 th Oct,2010 Introduction Buses are the simplest and most widely used interconnection networks A number of modules is connected via a single

More information

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important Storage Hierarchy III: I/O System Readings reg I$ D$ L2 L3 memory disk (swap) often boring, but still quite important ostensibly about general I/O, mainly about disks performance: latency & throughput

More information

High-Speed NAND Flash

High-Speed NAND Flash High-Speed NAND Flash Design Considerations to Maximize Performance Presented by: Robert Pierce Sr. Director, NAND Flash Denali Software, Inc. History of NAND Bandwidth Trend MB/s 20 60 80 100 200 The

More information

BASIC COMPUTER ORGANIZATION. Operating System Concepts 8 th Edition

BASIC COMPUTER ORGANIZATION. Operating System Concepts 8 th Edition BASIC COMPUTER ORGANIZATION Silberschatz, Galvin and Gagne 2009 Topics CPU Structure Registers Memory Hierarchy (L1/L2/L3/RAM) Machine Language Assembly Language Running Process 3.2 Silberschatz, Galvin

More information

A Scalable Multiprocessor for Real-time Signal Processing

A Scalable Multiprocessor for Real-time Signal Processing A Scalable Multiprocessor for Real-time Signal Processing Daniel Scherrer, Hans Eberle Institute for Computer Systems, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland {scherrer, eberle}@inf.ethz.ch

More information

Matrox Imaging White Paper

Matrox Imaging White Paper Reliable high bandwidth video capture with Matrox Radient Abstract The constant drive for greater analysis resolution and higher system throughput results in the design of vision systems with multiple

More information

Design and Implementation of High Performance DDR3 SDRAM controller

Design and Implementation of High Performance DDR3 SDRAM controller Design and Implementation of High Performance DDR3 SDRAM controller Mrs. Komala M 1 Suvarna D 2 Dr K. R. Nataraj 3 Research Scholar PG Student(M.Tech) HOD, Dept. of ECE Jain University, Bangalore SJBIT,Bangalore

More information

PC-based data acquisition II

PC-based data acquisition II FYS3240 PC-based instrumentation and microcontrollers PC-based data acquisition II Data streaming to a storage device Spring 2015 Lecture 9 Bekkeng, 29.1.2015 Data streaming Data written to or read from

More information

Interfacing RLDRAM II with Stratix II, Stratix,& Stratix GX Devices

Interfacing RLDRAM II with Stratix II, Stratix,& Stratix GX Devices Interfacing RLDRAM II with Stratix II, Stratix,& Stratix GX Devices November 2005, ver. 3.1 Application Note 325 Introduction Reduced latency DRAM II (RLDRAM II) is a DRAM-based point-to-point memory device

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

High-Performance Sort Chip

High-Performance Sort Chip High-Performance Sort Chip Shinsuke Azuma, Takao Sakuma, Takashi Nakano, Takaaki Ando, Kenji Shirai azuma@icc.melco.co.jp Mitsubishi Electric Corporation Hot Chips 1999 1 Overview Background Algorithm

More information

Recap: Machine Organization

Recap: Machine Organization ECE232: Hardware Organization and Design Part 14: Hierarchy Chapter 5 (4 th edition), 7 (3 rd edition) http://www.ecs.umass.edu/ece/ece232/ Adapted from Computer Organization and Design, Patterson & Hennessy,

More information

Introduction to I/O. April 30, Howard Huang 1

Introduction to I/O. April 30, Howard Huang 1 Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

INT 1011 TCP Offload Engine (Full Offload)

INT 1011 TCP Offload Engine (Full Offload) INT 1011 TCP Offload Engine (Full Offload) Product brief, features and benefits summary Provides lowest Latency and highest bandwidth. Highly customizable hardware IP block. Easily portable to ASIC flow,

More information

Lecture 23. Finish-up buses Storage

Lecture 23. Finish-up buses Storage Lecture 23 Finish-up buses Storage 1 Example Bus Problems, cont. 2) Assume the following system: A CPU and memory share a 32-bit bus running at 100MHz. The memory needs 50ns to access a 64-bit value from

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

ADQ14 Development Kit

ADQ14 Development Kit ADQ14 Development Kit Documentation : P Devices PD : ecurity Class: : Release : P Devices Page 2(of 21) ecurity class Table of Contents 1 Tools...3 2 Overview...4 2.1 High-level block overview...4 3 How

More information

ISSN Vol.05, Issue.12, December-2017, Pages:

ISSN Vol.05, Issue.12, December-2017, Pages: ISSN 2322-0929 Vol.05, Issue.12, December-2017, Pages:1174-1178 www.ijvdcs.org Design of High Speed DDR3 SDRAM Controller NETHAGANI KAMALAKAR 1, G. RAMESH 2 1 PG Scholar, Khammam Institute of Technology

More information

AN 829: PCI Express* Avalon -MM DMA Reference Design

AN 829: PCI Express* Avalon -MM DMA Reference Design AN 829: PCI Express* Avalon -MM DMA Reference Design Updated for Intel Quartus Prime Design Suite: 18.0 Subscribe Latest document on the web: PDF HTML Contents Contents 1....3 1.1. Introduction...3 1.1.1.

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Employing Multi-FPGA Debug Techniques

Employing Multi-FPGA Debug Techniques Employing Multi-FPGA Debug Techniques White Paper Traditional FPGA Debugging Methods Debugging in FPGAs has been difficult since day one. Unlike simulation where designers can see any signal at any time,

More information

Common DMA Engine Interface

Common DMA Engine Interface Washington University in St. Louis Washington University Open Scholarship All Computer Science and Engineering Research Computer Science and Engineering Report Number: WUCSE-2012-40 2012 Common DMA Engine

More information

Chapter 12: Multiprocessor Architectures

Chapter 12: Multiprocessor Architectures Chapter 12: Multiprocessor Architectures Lesson 03: Multiprocessor System Interconnects Hierarchical Bus and Time Shared bus Systems and multi-port memory Objective To understand multiprocessor system

More information

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55% openbench Labs Executive Briefing: May 20, 2013 Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55% Optimizing I/O for Increased Throughput and Reduced Latency on Physical Servers 01

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

NEV Spec. Document Version R01838_07

NEV Spec. Document Version R01838_07 NEV Spec Salt Lake City, UT, USA Document Version R01838_07 Contents File Format Overview... 3 NEV File Format... 3 NEV Basic Header... NEURALEV... NEV Extended Headers... 5 NEUEVWAV... 5 NEUEVFLT... 6

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources

Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources Experimental Study of Virtual Machine Migration in Support of Reservation of Cluster Resources Ming Zhao, Renato J. Figueiredo Advanced Computing and Information Systems (ACIS) Electrical and Computer

More information

Lecture 2: September 9

Lecture 2: September 9 CMPSCI 377 Operating Systems Fall 2010 Lecture 2: September 9 Lecturer: Prashant Shenoy TA: Antony Partensky & Tim Wood 2.1 OS & Computer Architecture The operating system is the interface between a user

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350):

The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): The Memory Hierarchy & Cache Review of Memory Hierarchy & Cache Basics (from 350): Motivation for The Memory Hierarchy: { CPU/Memory Performance Gap The Principle Of Locality Cache $$$$$ Cache Basics:

More information

Advanced cache optimizations. ECE 154B Dmitri Strukov

Advanced cache optimizations. ECE 154B Dmitri Strukov Advanced cache optimizations ECE 154B Dmitri Strukov Advanced Cache Optimization 1) Way prediction 2) Victim cache 3) Critical word first and early restart 4) Merging write buffer 5) Nonblocking cache

More information

Effective Verification of ARM SoCs

Effective Verification of ARM SoCs Effective Verification of ARM SoCs Ron Larson, Macrocad Development Inc. Dave Von Bank, Posedge Software Inc. Jason Andrews, Axis Systems Inc. Overview System-on-chip (SoC) products are becoming more common,

More information

AN 690: PCI Express DMA Reference Design for Stratix V Devices

AN 690: PCI Express DMA Reference Design for Stratix V Devices AN 690: PCI Express DMA Reference Design for Stratix V Devices an690-1.0 Subscribe The PCI Express Avalon Memory-Mapped (Avalon-MM) DMA Reference Design highlights the performance of the Avalon-MM 256-Bit

More information

Cache introduction. April 16, Howard Huang 1

Cache introduction. April 16, Howard Huang 1 Cache introduction We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? The rest of CS232 focuses on memory and input/output issues, which are frequently

More information

Disks and I/O Hakan Uraz - File Organization 1

Disks and I/O Hakan Uraz - File Organization 1 Disks and I/O 2006 Hakan Uraz - File Organization 1 Disk Drive 2006 Hakan Uraz - File Organization 2 Tracks and Sectors on Disk Surface 2006 Hakan Uraz - File Organization 3 A Set of Cylinders on Disk

More information

Introduction to Partial Reconfiguration Methodology

Introduction to Partial Reconfiguration Methodology Methodology This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Define Partial Reconfiguration technology List common applications

More information

Storage Systems. Storage Systems

Storage Systems. Storage Systems Storage Systems Storage Systems We already know about four levels of storage: Registers Cache Memory Disk But we've been a little vague on how these devices are interconnected In this unit, we study Input/output

More information

Accelerating Business Analytics with Flash Storage and FPGAs

Accelerating Business Analytics with Flash Storage and FPGAs Accelerating Business Analytics with Flash Storage and FPGAs Satoru Watanabe Center for Technology Innovation - Information and Telecommunications Hitachi, Ltd., Research and Development Group Aug.10 2016

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정학수, 최주영 1 Outline Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions

More information

This Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System

This Unit: Main Memory. Building a Memory System. First Memory System Design. An Example Memory System This Unit: Main Memory Building a Memory System Application OS Compiler Firmware CPU I/O Memory Digital Circuits Gates & Transistors Memory hierarchy review DRAM technology A few more transistors Organization:

More information

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming

Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming Fall 2006 University of California, Berkeley College of Engineering Computer Science Division EECS John Kubiatowicz Midterm II December 4 th, 2006 CS162: Operating Systems and Systems Programming Your

More information

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory

A Simulation: Improving Throughput and Reducing PCI Bus Traffic by. Caching Server Requests using a Network Processor with Memory Shawn Koch Mark Doughty ELEC 525 4/23/02 A Simulation: Improving Throughput and Reducing PCI Bus Traffic by Caching Server Requests using a Network Processor with Memory 1 Motivation and Concept The goal

More information

Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB

Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Hardware-Efficient Parallelized Optimization with COMSOL Multiphysics and MATLAB Frommelt Thomas* and Gutser Raphael SGL Carbon GmbH *Corresponding author: Werner-von-Siemens Straße 18, 86405 Meitingen,

More information

EE414 Embedded Systems Ch 5. Memory Part 2/2

EE414 Embedded Systems Ch 5. Memory Part 2/2 EE414 Embedded Systems Ch 5. Memory Part 2/2 Byung Kook Kim School of Electrical Engineering Korea Advanced Institute of Science and Technology Overview 6.1 introduction 6.2 Memory Write Ability and Storage

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

INPUT/OUTPUT ORGANIZATION

INPUT/OUTPUT ORGANIZATION INPUT/OUTPUT ORGANIZATION Accessing I/O Devices I/O interface Input/output mechanism Memory-mapped I/O Programmed I/O Interrupts Direct Memory Access Buses Synchronous Bus Asynchronous Bus I/O in CO and

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck

Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Novel Intelligent I/O Architecture Eliminating the Bus Bottleneck Volker Lindenstruth; lindenstruth@computer.org The continued increase in Internet throughput and the emergence of broadband access networks

More information

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel

Chapter-6. SUBJECT:- Operating System TOPICS:- I/O Management. Created by : - Sanjay Patel Chapter-6 SUBJECT:- Operating System TOPICS:- I/O Management Created by : - Sanjay Patel Disk Scheduling Algorithm 1) First-In-First-Out (FIFO) 2) Shortest Service Time First (SSTF) 3) SCAN 4) Circular-SCAN

More information

How Verify System Platform Max. Performance

How Verify System Platform Max. Performance How Verify System Platform Max. Performance Instructions for all SSD7000 NVMe RAID Controller (for Windows platforms) Maximizing the performance potential of your NVMe storage starts with your host platform.

More information