A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI I IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Size: px

Start display at page:

Download "A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI I IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF"

Peregrine Nash
5 years ago
Views:

1 USING A PCI SCHEDULER AND A DYNAMIC THRESHOLD TO ENHANCE A HIGH SPEED READOUT SYSTEM A THESIS SUBMITTED TO THE GRADUATE DIVISION OF THE UNIVERSITY OF HAWAI I IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN ELECTRICAL ENGINEERING JUNE 2005 By Bin Wei Thesis Committee: Nancy Reed, Chairperson Ying-Fei Dong Anna Hać

2 We certify that we have read this thesis and that, in our opinion, it is satisfactory in scope and quality as a thesis for the degree of Master of Science in Electrical Engineering. THESIS COMMITTEE Chairperson ii

3 Acknowledgements This research has been funded by High Energy Accelerator Research Organization (KEK). Belle is an experiment at the KEK, and it is an international collaboration from 13 countries (Australia, Austria, China, Germany, India, Japan, Korea, Poland, Russia, Slovenia, Switzerland, Taiwan, and U.S.A.). i

4 Abstract Data acquisition systems are products and/or processes used to collect information to document or analyze some phenomenon. The Readout System is one of the most important parts of a data acquisition system. The Peripheral Component Interconnect (PCI) bus has been used widely in current Readout systems because of its high efficiency and low cost. In this paper, we focus on the software design to improve PCI throughput. The new Readout system software is designed and developed based on the Black Board Communication Architecture and Multi Agent Theory. A PCI Scheduler is used for distributing the PCI resources by dynamically updating the priorities of the agents. A dynamically updated threshold is also used in this system to continually adjust the packet size to increase throughput. Our experimental results show that the agent-based Readout software can dramatically improve the system throughput. ii

5 TABLE OF CONTENTS Acknowledgements... i Abstract... ii List of figures... iv List of tables... v 1. Introduction Background The Belle Data Acquisition System The Current Readout Subsystem Data Acquisition Process Linux Scheduler Strategy The Bottleneck of the Readout system Performance Without A Network Performance With A Network PCI Scheduler and Dynamic Threshold Design in COPPER Compression Dynamic Interrupt DMA Readout Scheme Evaluation System Single Thread vs. Multi Threads Agent-Based Readout Design Simple PCI Scheduling Strategy PCI Scheduling Strategy Dynamic Threshold Updating Discussion Conclusion References iii

6 LIST OF FIGURES Figure 2.1 Belle Data Acquisition System... 4 Figure 2.2 COPPER board schematic... 7 Figure 2.3 data transfer process... 7 Figure 2.4 Evaluation System Figure 3.1 COPPER performance without network...11 Figure 4.1 DMA Transfer Figure 4.2 Agent-Based design Figure 4.3 DMA setup Figure 4.4 Simple PCI scheduler Figure 4.5 Blackboard communication Figure 4.6 Agents Communication Figure 4.7 Gathering Agent PCI request scheme Figure 4.8 Sending Agent PCI request scheme Figure 4.9 Dynamic Priority Design Figure 4.10 Data Transfer Result Figure 4.11 Dynamically updating priority Figure 4.12 PCI Interrupt Service algorithm Figure 4.13 Self-checking algorithm Figure 5.1 Compression rate / Throughput iv

7 LIST OF TABLES Table 5.1 without agent-based design Table 5.2 agent-based design 73MB/s Table 5.3 agent-based design 77MB/s Table 5.4 agent-based design FIFO-MEMO 77MB/s, MEMO-NIC 77MB/s Table 5.5 Comparison of single thread and agent-base multi thread design Table 5.6 Throughput of single thread and agent-base multi thread design v

8 Chapter 1 Introduction Data acquisition systems (DAQ), as the name implies, are products and/or processes used to collect information to document or analyze some phenomenon. Typically, there are three common parts: a detector, a readout section and storage. The detector generates analog signals to describe outside environmental changes and the readout converts analog signals to digital signals and saves them in storage. Common uses for DAQs include environmental monitoring high energy particles for example. The readout performance directly affects the entire system s performance. The readout system is a bridge in the DAQ system between the detectors and storage. If this bridge is too narrow to transfer the incoming data, there could be large data loses. The processing frequency of analog to digital conversion is in the picoseconds (10-12 second) range. This requires a fast transfer method to match the increasing speed of data generation. In the HEP (High Energy Physics) area, Most Readout systems use a Peripheral Control Interface (PCI) system bus, because of its high efficiency and low cost, compared to crate-embedded processors [19]. But the PCI bus also connects peripheral components, and it is very slow (compared to memory accesses). It is only 33MHz with 32 bits in standard mode. To improve PCI throughput, researchers can chose the more recent 66MHz, 64 bits bus [21]. Also, they can choose to use Bus-Mastering DMA (Direct Memory Access) mode to transfer data. DMA saves can save dramatically on PCI overhead [19][20]. But the Bus-Mastering DMA needs some cycles for setup. Therefore, the study concluded 1

9 that the longer the burst, the higher performance is due to reduced DMA overhead. Unfortunately, previous work has not addressed how to set a suitable DMA packet size, because both huge and tiny packages for DMA transfer could reduce the system s performance (see in chapter 4). Previous work also not addressed how to schedule PCI resources when more than two process tasks need to use a PCI bus. Such issues do not need to be considered for slow speed equipment, because a standard PCI bus is fast enough. But in a high speed DAQ system, we have to address those issues to make PCI transfer without data loss possible. In this thesis, we focus on how to improve a Readout system s throughput by dynamically adjusting packet sizes in a Belle Data Acquisition System (Belle DAQ) with software. The DAQ system is used for high-energy physics experiments in the High Energy Accelerator Research Organization in Japan (KEK/Japan). The software design is most important to enhance the throughput of the readout system but hardware improvements that increase the performance of the software are also discussed. The Readout system is designed base on a Multi-Agent Architecture that has been used in many manufacturing systems [10][11], and a Blackboard communication architecture [17]. Our system has six agents, a gathering agent, a processing agent, a sending agent, a scheduler agent, a command agent and a monitor agent. Among these agents, the scheduler agent s task is to fairly and efficiently allocate PCI resources to other agents who request use of the PCI bus at the same time. A priority value is associated with each agent. This value is updated (using bonus and penalty points) during system operation. 2

10 Another important issue that we discuss is how to control the dynamic interrupt by setting a threshold (packet size register). A small threshold causes more DMA (Direct Memory Access) overhead in the PCI bus. A large threshold causes a flood of data and an idle bus. Since we cannot predict the frequency of data coming nor the data size, we need to dynamically adjust the threshold. The rest of this thesis is organized as follows. Chapter 2 introduces the background of this project from the entire data acquisition system, and the related Linux scheduler strategy. Chapter 3 analyzes and demonstrates the bottleneck in the readout system. In Chapter 4, a PCI scheduler and a dynamic threshold strategy are used in the readout system to improve the throughput. We compare the results of a design with no agent to our agent based design in Chapter 5. In the conclusion Chapter 6, test results show a curve that illustrates the comparison of the expected system throughput between our agent-based design and the current (non agent-based) design. We found that our agent-based design can dramatically improve the next generation Belle DAQ system throughput. 3

Chapter 2 Background 2.1 The Belle Data Acquisition System Belle DAQ consists of five major subsystems, a Detector, an Event Readout System, an Event Builder, a Trigger Control and a Master Control.

11 Chapter 2 Background 2.1 The Belle Data Acquisition System Belle DAQ consists of five major subsystems, a Detector, an Event Readout System, an Event Builder, a Trigger Control and a Master Control. A simple system diagram is shown in Figure 2.1. Figure 2.1 Belle Data Acquisition System Overview The original Belle DAQ Readout subsystem bus is a FASTBUS [22] that has a low transfer rate and is the bottleneck in the entire system. To meet the current requirements of the trigger rate and data transfer rate, the FASTBUS is replaced by a PCI bus. The PCI bus used in next generation DAQ systems has a faster data transfer rate of 125Mbytes/second (32 bit, 33MHz) without DMA overhead. By analyzing the results from an experimental PCI Readout system, we confirm that the bottleneck of the next generation DAQ system is still the PCI bus. The competition between the NIC (Network Interface 4

12 Card) sending data to the Ethernet and the Readout operations decreases the system performance. The stable trigger rate of the current Belle DAQ is 250 Hz [23], and the maximum transfer rate is 10M bytes/second. With the increased need for faster and greater data acquisition, the current sample rate and transfer rate is no longer sufficient. One of the next generation s requirements proposed by KEK/Japan is a new system with a trigger rate of at least 10kHz, which is significantly higher than the trigger rate of the current system (250 Hz). A huge jump in the trigger rate produces a dramatic increase in the data transfer rate necessary in the Readout subsystem. Figure 2.1 shows the entire Belle DAQ system. One Readout Subsystem runs in parallel with one Detector at a designed rate of 40kHz. All the subsystems use the TDC Readout, which is controlled by a standardized scheme with PMC (PCI Mezzanine Cards). The trigger information is provided by SVD, CDC, TOF, ECL and processed by GDL to make a trigger decision. The decision is distributed by the Sequence Control System via a Timing Distributor Module in each subsystem VME. The readout data is transferred through a switch network to the Event Builder, where the detector-wise parallel data is reordered into event-wise parallel data and shipped to each node of the Online Farm. The events then pass through the Farm are stored into the Mass Storage system and eventually written to tape. All the subsystems are controlled centrally by the Master Control. The signal transfer is done either through a fast reflective memory network or through a conventional TCP/IP network. Both of the networks provide an experiment-wide shared memory in order to 5

13 store the useful run-related and environment-related information for control and monitoring purposes. 2.2 The Current Readout Subsystem The new high-density DAQ system includes crates, baseboards, daughter cards for front-end A/D (analog to digital) T/D (time to digital) conversion, and back-end communication baseboards for data transfer and timing control. The crate is capable of holding size 9U Euro-cards and extension connectors. The COPPER (Common Pipelined Platform for Electronics Readout) board is comprised of a local bus and a standard PCI bus. The local bus s sequencer is connected to the front-end daughter cards via event FIFOs (First In First Out buffer) and a standard PCI bus that set by a PMC processor unit. A data transfer module, which is connected to the event building system through Ethernet, and a trigger control unit, which communicates with the central timing controller, are installed on the back-end communication card connected to the rear end of the baseboard. Figure 2.2 shows a block diagram of the COPPER board. There will be thousands of such boards working together in the final system. COPPER is a 9U-VME sized board. It is equipped with four slots of FINESSE (Front-end Instrumentation Entity for Sub-detector Specific Electronics) modules, readout FIFOs, and three PMC slots connected to the PCI bus as shown in figure 2.2. A CPU module (embedded system) for data compression resides in one of three PMC slots. 6

for Sub-detector Specific Electronics) cards from the Detector, and the signals are digitized in the FINESSE card.

14 2.3 Data Acquisition Process Figure 2.2 COPPER board layout Figure 2.3 data transfer process The analog signals detected are first put into one of the four FINESSE (Front-end Instrumentation Entity for Sub-detector Specific Electronics) cards from the Detector, and the signals are digitized in the FINESSE card. Upon receipt of a level-1 trigger from a trigger module, the FINESSE module pushes a package of data into the readout Event FIFOs (as first-in, first-out queue) on the COPPER board. Each Event FIFO is connected to 7

15 the local bus as shown in figure 2.2. The processor module collects data from the Event FIFOs and transfers them to the main memory through a local-pci bus bridge, PCI9054 [8], using DMA mode. A sequencer on the local bus counts the number of words in each Event FIFO associated with every trigger signal to provide the event size to the processor module and to signal the level-1 trigger when the FIFOs get full. In the processor module, there is a two-step transfer. The first step is fetching data from all Event FIFOs into main memory through the PCI bus, and the second one is forwarding data to the NIC through the PCI bus, as shown in figure 2.3. The processor module uses the Linux operating system. 2.4 Linux Scheduler Strategy The scheduling algorithm of traditional Unix operating systems must fulfill several conflicting objectives: fast process response time, good throughput for background jobs, avoidance of process starvation, reconciliation of the needs of low- and high-priority processes, and so on [6]. The scheduling policy of the Linux kernel is based on a time-sharing technique and ranking processes according to their priority. Here we skip the complicated scheduling algorithms used to derive the current priority of a process. See Robert Love s Linux Kernel Development [7] for details. The end result is that each process is associated with a value that denotes how important it is to be assigned to the CPU next. In Linux, process priority is dynamic [6]. The scheduler keeps tracking processes and adjusts their priorities periodically. Therefore, processes that have been denied the use of 8

16 the CPU for a long time interval are boosted by dynamically increasing their priority. Correspondingly, processes running for a long time are penalized by decreasing their priority. In a Linux system, processes are in one of 3 classes: Interactive processes: These processes interact constantly with their users, and spend a large amount of time waiting for user s operations (key presses and mouse operations). When input is received, the process must be swapped in quickly, or the user will feel that the system not responding. The typical programs include command shells, text editors, Internet browser and so on [6]. Batch processes: These processes do not need any user interaction, hence they often run in the background, and they are often penalized by the scheduler. Typical batch programs include compilers, database search engines, web servers, and scientific computations. [6] Real-time processes: These processes have very high priorities. Such processes should never be blocked by lower-priority processes, they should have a short response time and, most importantly, the response time should have a minimum variance and guaranteed maximum. The typical real-time programs are video and sound applications, robot controllers, and programs that collect data from physical sensors, like our physics collector. [6] The approved CPU quantum duration in a Linux kernel is critical for system performance: it should be neither too long nor too short. If the quantum duration is too short, the system overhead caused by task switches becomes excessively high. For instance, suppose that a task switch requires 10 milliseconds; if the quantum duration is also set to 10 9

17 milliseconds, then at least 50% of the CPU cycles will be dedicated to task switching. If the quantum duration is too long, processes no longer appear to be executing concurrently. For instance, let's suppose that the quantum is set to five seconds; each runnable process makes progress in about five seconds, then it stops for a very long time (typically, five seconds times the number of running processes). 10

18 Chapter 3 The Bottleneck of the Readout system 3.1 Performance Without A Network From the previous discussion, we can see that the PCI bus is the bottleneck of the entire system and PCI transfer overhead dramatically affects the PCI transfer rate. To verify this, we setup a test bed using an experimental FINESSE card that can generate virtual digital data, and use a self-trigger module to supply pulse for the digital experiment card. We increase the trigger rate to let the experiment card increase its data-generating rate to figure out the maximum throughput of this system and find out which part is the key that affects system performance the 416 bytes/ev/adc-module Required trigger rate Typical trigger rate Figure 3.1 COPPER performance without a network The default event size of the experiment card is 416 bytes with each trigger signal. We use an evaluation FINESSE card that can constantly generate data when the input triggers 11

19 come in. As the input trigger increases, the rate of data generation (accepted trigger) also increases. We keep increasing the input trigger rate, until the generated data is full enough to send to CPU module. The test results of manipulating the rigger rate is shown in Figure 3.1. When we increase the input trigger rate to 40kHz, the accepted trigger rate failed to increase. The measured result of the embedded system is as follow. User time is ~2%, system time is ~20%, and idle time is ~ 78%. We found that there is a large fraction of idle time, which indicates that the PCI bus is working at full performance. The ideal PCI throughput is 32bit * 33mHz = 125MB/s, while the real PCI throughput in the experiment is: Size * Rate * Number of modules = Throughput, event event PCI 416bytes/ADC-module/ev * 40kHz * 4ADC modules = 67MB/s. 3.2 Performance With A Network Because the data collected from a FIFO must be transferred out of the COPPER system via an Ethernet connection, we need to analyze the performance of the network. The experiment uses a 33kHz and 32 bit bandwidth PCI bus as the system bus, a 10/100 Base T network interface card in the COPPER system, and Linux as the operating system. The NIC transfer rate is 11MB/s estimated which is the maximum value of the 10/100 Base T card, at the maximum accepted trigger rate of 32kHz. CPU use time (CPU time of user processes) is ~5%, CPU system time (CPU use time of kernel processes) is 31%, CPU idle time (no process use CPU) is 64%. We found that the NIC on the embedded system is also a bottleneck if we do not compress data in the CPU module to greatly reduce the output data size. 12

20 Chapter 4 PCI Scheduler and Dynamic Threshold Design in COPPER The goal of this project is to improve the data transfer rate from the Event FIFO to the NIC. We will improve the throughput from two perspectives. One is a scheduler that can fairly and efficiently distribute PCI resources to agents who request resources at the same time, so we can reduce and avoid PCI competition. Another is dynamically updating the threshold that adjusts PCI packet size according to the current system situation to improve transfer rate. Before we talk about our PCI scheduler and dynamic threshold design, we describe additional techniques used in our design. 4.1 Compression As shown in see section 3.2, 10/100 Base T NIC is slow compared to the high-speed of Readout system. The maximum throughput of the 10/100 Base T NIC is 11MB/s, while the maximum throughput of the system is as large as 67MB/s. Therefore a 1G NIC is necessary. Another problem arises after we add the 1G NIC into the PCI bus. Competing with a faster NIC, the data readout rate through the PCI bus will be decreased. In the worst case, the readout throughput will be decreased by half (33MB/s) if all the uncompressed data goes through NIC. So NIC updating is not an eventual solution. Another method that can improve the system throughput is compressing data before it is transferred to the NIC through the PCI. And a higher rate of compression should increase the throughput, because 13

21 it can dramatically reduce PCI transfer time from CPU to NIC and save much bandwidth for data fetching from an event. We do not consider designing new compression algorithms, rather we use existing compression algorithms. Without the scheduler algorithm and dynamic adjustment threshold, the system throughput will not be as good as we expect. More details can be found in chapter Dynamic Interrupt Chapter 2 describes how an interrupt is generated when the size of event data reaches a threshold value. The DMA set up procedure needs some system cycles and the overhead is big when only small amounts of data need to be transferred. To reduce this DMA transfer overhead, we designed an interrupt scheme that can tell the CPU to fetch data from the Event FIFOs when the size of event data has reached a specific value. This scheme has been implemented in firmware design. It uses a register to record the threshold value and uses a counter to monitor the current data size. When implementing this, the threshold should be set to make DMA transfer data the PCI bus as efficient as possible. This by itself will not improve system performance enough. We discuss how to dynamically adjust the threshold value during execution in section 4.3. Section 2.3 described the DMA mode data transfer from Event FIFOs to the CPU and from the CPU to the NIC. Next we discuss DMA transfer in detail, and then we show our design to efficiently utilize PCI resources during DMA transfer. A software scheduler in the CPU board assigns the PCI bus to fetch and send data. 4.3 DMA Readout Scheme When we introduced the background of the COPPER system, we have introduced a 14

22 sequencer that counts how much data the FINESSE cards have written into Event FIFOs. And we use an interrupt strategy introduced in section 4.2 to pass the current count of the sequencer to the CPU once it has reached a certain number. After this, the CPU can check how much data is available and initialize DMA transfer by assigning source/destination address and data size. The entire DMA transfer procedure is shown in figure 4.1. PCI bus Local bus Local-PCI Bridge PLX-9054 Readout FIFO FIFO-wordcounter FIFO Readout FIFO filled PCI interrupt Check event size Initiate DMA DMA data transfer DMA over PCI interrupt CPU RadiSys EPC-6315 Figure 4.1 DMA Transfer After the Event FIFO-CPU DMA transfer finishes, the user program processes fetched data and forwards it to the NIC. Next we discuss user program design. 4.4 Evaluation System Actually Read data is not available since the hardware design for the Readout system is not complete and the ADC cards are also not complete. To evaluate our system we designed simulation software to mimic incoming data to test our design s performance. The entire system is shown in figure 4.2. In the middle of this diagram is a COPPER Readout system. The Test subsystem sends constructed data to a connected FINESSE card on a COPPER board, and gets results from the COPPER board through the Ethernet. The data transfer 15

23 correctness and system throughput performance is then calculated. The Monitor subsystem is used for monitoring the system status while it is running. Periodically the COPPER system should send out system status information. The Monitor subsystem gets this data to display on the user s console. Another function of the Monitor subsystem is to send commands to the COPPER system through its Ethernet port, such as starting, stopping and resetting components. Send test data Receive test data Test system Status info Detector Singal ReadOut CPU Monitor system PCI NIC Event data Event builder CoPPER Figure 4.2 Evaluation System 4.5 Single Thread vs. Multi Threads We can improve the software in light of the hardware improvements now available. The previous test software used a single thread mode to control this system when there is not much compression in the CPU and no DMA transfer. The advantages of using a multi threaded architecture include: 1. Transfer and compression threads can work in parallel. Because we use a PCI9054 (PCI bus controller) as a bridge between a Local bus and a PCI bus, and a PCI2050 as a 16

24 PCI to PCI Bridge, we can use DMA mode to transfer data from the Event FIFOs to the CPU and to forward data to NIC. This can greatly reduce the CPU time necessary. During DMA transfer, data packaging and compressing threads can work in parallel, because they do not need to use the PCI bus. 2. Simplify the system design. We use one thread for each specific action, and let the scheduler decide which thread should use the common resources first, according to the scheduling policy and the current situation. The goal of this scheduler is to make the PCI resource maximally utilized. It also needs to guarantee not to block the transfer path (if the FIFOs are full, the system has to drop some data). In other words, it needs to keep the possible bottleneck resource (PCI bus) well utilized. This scheduler should be light weight, not consuming too much CPU time. According to the above analysis, we implemented the scheduler as on a multi-agent system. The scheduler is the core of this system. It decides which agent can use the common and critical resource at a certain time according to the scheduler policy, the current situation and the environment. Meanwhile, each agent only tries its best to finish its own work. The figure 4.3 shows this multi agent design. 4.6 Agent-Based Readout Design This system has six agents: a gathering agent, a command agent, a processing agent, a monitor agent, a sending agent and a scheduler agent. The scheduler agent is the core of this system. The gathering, sending and command agents need to send requests to the scheduler agent to get PCI resources. Monitor agent is an individual part. More detailed description of each agent follows. 17

25 Scheduler agent Command agent Event Data Queue Ping Pong Buffer Gathering agent Process agent Sending agent Ev ent builder event data Monitor agent PCI request Monitor PCI reponse data transfer collect data buf f er status Figure 4.3 Agent-Based design Gathering Agent fetches data from Event FIFO The Gathering Agent is the interface between data acquisition and data processing. After receiving almost full interrupt of Event FIFO from PCI bus, it asks the scheduler about its right to use the PCI bus. Then it can transfer data from an event FIFO to the CPU memory (event data queue). This is a key step in this entire data transfer procedure. We need to make sure not to block the event FIFO transfer from the FINESSE card. If the gathering agent does not fetch data immediately from event FIFO when it is full, the FINESSE card will stop digitizing data into local memory. Another important task only the gathering agent can do, is identifying where this data comes from, because it can access the event FIFO and know the location of the event FIFO. 18

26 Processing Agent packets data from a Gathering Agent, and compresses the packets. Each Processing agent gets data from an event data queue, compresses and packets it, and then saves it into the sending buffer (Ping Pong buffer) for the sending agent to proceed with the transfer. The procedure of processing agent is shown in the center of figure 4.3, For instance, there are 4 FINESSE cards in one COPPER board, and there will be a thousand COPPER boards working concurrently. To synchronize data transfer, there will be a common L1 trigger generator that can distribute synchronized triggers to all the COPPER boards, and each COPPER board distributes to its FINESSE cards. When the FINESSE card digitizes an in coming signal, the firmware of the FINESSE card will add trigger information into a data packet to synchronize data. Then processing agent formats the original data coming from event data queue, and sends it to sending buffer. Compression is a very important task to improve system performance, because this COPPER system is a special purpose system. We analyze and discuss the compression rate and system performance in chapter 5. We won t discuss the data compression module, because the selection of compression algorithms depends on the data format, which is beyond the scope of this paper. Concurrently, we did not use compression in last generation, because the data throughput is much less than COPPER system. Sending Agent transfer data from CPU board to Ethernet card We use two same size memory buffers to implement a Ping-Pong strategy to improve 19

27 memory access. Normally, a memory area is prevented from other processes or threads access when a process or thread is writing or reading this area (operating system strategy). The Ping Pong strategy uses two isolated memory areas (write and read buffers) for writing and reading. Therefore, reading and writing processes can work at the same time. After the write buffer is full and the read buffer is empty, the function pointers of the two buffers are switched. In figure 4.3, the data transfer from process agent to sending agent is only a memory operation, and the sending agent uses PCI DMA mode to forward data. Therefore, the advantage of using a Ping-Pong strategy is that data transfer from the CPU to the Ethernet card and internal memory operation can execute concurrently. PCI Bus PCI Controller Set DMA Mode Set PCI Address Set Local Bus Address Set Transf er Size (by te) Set Direction of Transf er DMA Start CPU Figure 4.4 DMA setup The Sending agent is a greedy agent, and it always wants to use PCI bus, even if there is little data to be transferred. But the problem is if the DMA package is too small, it would waste more PCI resources (DMA setup need many cycles), and decrease the performance of the entire system. The DMA setup procedure is demonstrated in figure 4.4. From another perspective, if we assign very much PCI resources to sending agent once, others would not have chance to use PCI resources in the sending agent transfer 20

28 period. So we try to use a scheduler to allocate limited PCI resources among the three agents (gathering agent, processing agent, and sending agent) according to their priorities and current status. Ideally, we want to utilize the idle time of the gathering agent for the sending agent to transfer data. We will talk about the scheduler strategy in detail in section 4.7. Scheduler Agent Allocates PCI resource The Scheduler Agent is the core of this system. It receives requests for PCI resources from the gathering, sending and command agents, and assigns PCI resource to them according to its scheduling policy. We talk about the scheduler agent in detail in the following sections. Command Agent receives outside commands and executes them The command agent should have highest priority in all the agents. When a user sends a command to start or stop a COPPER system or other operations, other actions of COPPER will stop, and the command action will take control of the entire system. Monitor Agent periodically reports system status The Monitor agent periodically collects the status of COPPER board and sends it to monitor system (out of COPPER system). 4.7 Simple PCI Scheduling Strategy In the Holonic Scheduler [10] by M.Fletcher, R.W.Brennan, they consider the tradeoff when trying to maximize both resource-oriented and order-oriented processing criteria in an integrated scheduler. We also considered the tradeoff in our PCI scheduling between PCI resource and robust data detection in the event FIFOs. 21

29 Scheduler while msg = Getmessage() //Wake up UpdateStatus() //Update agents status if (no agents running){ //Check no agent running GetRequest() //Pick a requst by first in first out Response() //Notice agent } GoToSleep() //Go to sleep end Figure 4.5 Simple PCI scheduler The above algorithm shows the simple and prototype policy for PCI scheduling. If we simply rely on this first-come-first-served selected policy, we will have the following problems. (1) Block of Data path If there is a large amount of data available to be sent from sending agent to NIC, the sending agent will hold the PCI bus for a long time. During this period, the gathering agent can t fetch data from an event FIFO. Thus even if the event FIFO is full, at this time, the FINESSE card would not work with the incoming trigger signals. The direct result is loss of experiments data. We must avoid any data lost in our data acquisition system. (2) Waste of PCI resource If a sending agent always does its best to send data even though there is only a few bytes data, there will be lots of DMA overhead data in the PCI which decreases system throughput. We know setting up a DMA channel costs many cycles for giving the PCI controller the source/destination address, the transfer mode, and the size of the data. It seems better to transfer as much data as possible at the same time. However, if we set a very large threshold for the starting transfer amount, the idle time of PCI bus will be very significant. 22

30 The objective of this scheduler is gathering data from the event FIFO in a timely manner, and providing good throughput for the PCI bus. Therefore, we can get a better system throughput. In a computer system, the critical resource is the CPU. CPU time-slice allocation is the most important thing to do to improve a multitasking operating system (Windows, UNIX and Linux). By intensive study of Linux kernel, we found that the Linux scheduler has many good properties and the same idea is suitable for using in our system to create a PCI scheduler. 4.8 PCI Scheduling Strategy In this section, we will talk about our PCI scheduling strategy, including agent communication, agent priority distribution among PCI request agents, transfer quantum definition and how to recalculate priority values to implement a dynamic priority Communication The blackboard model is a relatively complex problem-solving model that separates the data (Blackboard), organization of knowledge (agent s rule) and the problem-solving behavior (agent s activity) [17] in a system for problems. From the first blackboard system, HEARSY, this model has been widely used in the artificial intelligence area, such as image understanding [18] and industry assembly system [8]. The Blackboard is the public information exchange area that allows any agent to make public its own contribution to the partial interpretation, and to access public information about the contributions of other agents [18]. In our project, we choose the Blackboard as the backbone of communication. There are two types of communication. One is a resource request and response that is between 23

31 scheduler agent and other agents, another is data transfer which is between any two of data operation agents (gathering and process agent, process and sending agent). We have implemented three communication queues for managing the PCI resource request, event data and compressed data transfer. They are a queue for PCI requests, an event data queue and a Ping-Pong buffer for data transfer. Priority Status Time Command agent Gathering agent Send agent request queue event queue Read pointer switch (read buf f er is empty and write buf f er is f ull) Write pointer Ping buf f er Pong buf f er Ping Pong buffer Figure 4.6 Blackboard communications Each agent who wants to use the PCI bus has to post PCI request to the request queue and wait for approval. After getting an approved response, the agent can use PCI resources. A high priority request will get the PCI resource sooner. If two requests have the same priority, the one with an earlier time stamp will get approved first. Other requests shall wait until some agent release a PCI resource. We use a fixed length three element (priority, status and timestamp) array to represent the PCI request queue as in the top left of figure 4.6. Because there are only three agents 24

32 sharing the PCI resource, and no one can apply for the same resource twice. If the same agent applies PCI twice when the first request is still waiting in a queue, we think this agent has changed its mind, and we need to update its request s content and priority. For example, a gathering agent sends a low priority PCI request for checking Event FIFO (the amount of data in Event FIFO does not reach the threshold), when sending agent sends a higher priority PCI request for forwarding data to NIC at the same time. The scheduler agent assigns PCI resources to sending agent first, because its priority value is highest in the waiting queue. Before the data transfer of the sending agent is done (gathering agent PCI request is in waiting queue), there is an Event FIFO interrupt coming in (the amount of data in Event FIFO reaches the threshold). And the gathering agent applies the PCI resources again with a higher priority value. Then the PCI request of the gathering agent in waiting queue is replaced with the higher priority PCI request. After the sending agent is done, the gathering agent with higher priority gets to use the PCI bus. We use a heap to store all data fetched from the Event FIFO, and use a message queue to record the location (in the heap) and size of the data. After compression, the data is saved into a Ping-Pong buffer. The basic idea is when one buffer is written by a process agent, a sending agent reads data from the other buffer concurrently and writes the data to NIC through a DMA channel. Therefore, we implement a pipeline data transfer to improve system performance. In this blackboard architecture, real-time detection of a user request in the request queue is the first problem we need to take care of. If a scheduler constantly accesses the request queue to detect updates, this operation will use up all CPU resources. To avoid this problem, the scheduler normally sleeps when there is no request coming or no status 25

33 updating. And we use a signal that is set by request agents to wake up the scheduler and inform it that a request has been sent or the current status has been updated. After accessing blackboard, scheduler agent resets signal and goes to sleep. Figure 4.7 illustrates the request accepting strategy. Wake up Access blackboard message queue Gathering PCI request response Request or situation updated Sending PCI request response Scheduler Command PCI request response Reset signal wait for signal Go to sleep Figure 4.7 Agent Communication Priority Scheduling Our system involves 6 agents and 3 of them use PCI resources, including the Gathering Agent, the Sending Agent and the Command Agent. The Command agent is used by system administrator to reset, stop or start system or FINESSE card. We have introduced the Linux kernel scheduling policy in chapter 2. A Linux scheduler allocates CPU time slices according to different priority of each process. In our system, we use a PCI scheduler to allocate PCI time slices. In light of the similarity of the two tasks, we want to borrow ideas from Linux scheduling. 26

34 First of all, the Linux scheduler classifies processes into 3 types, real-time processes, batch processes and interactive processes. Similarly, we classify the agents in COPPER into the three types: gathering agent, sending agent and command agent. There are two kinds of PCI requests. The gathering agent sends requests to scheduler agent for PCI resources. One kind of requests is sent when an interrupt (represent threshold value is reached) coming in. The gathering agent is given a high priority for PCI request (interrupt PCI request), because we need to guarantee there is enough space in Event FIFO for FINESSE card transferring data. Otherwise, the FINESSE card will have to stop working until there is enough space. Therefore, when an (event FIFO almost full) interrupt comes in gathering agent, it should get PCI resource as soon as possible. And then it can immedeatily transfer data from event FIFO to memory. It s like the real-time processes in Linux. PCI request (Almost Full) Interrupt Gathering PCI request (Check) scheduler Reponse Figure 4.8 Gathering Agent PCI request scheme Another PCI request has a lower priority than other agents do. When gathering agent has finished data transfer from FIFO to CPU, and no new interrupt comes in, it will try to check data size in event FIFO Length buffer and decide if the interrupt threshold needs to be updated to avoid PCI idle. We will discuss this in section 4.8. Sending Agent forwards compressed and packeted data to PCI Ethernet card if the 27

35 PCI resources are available. This type of processing needs not to be real-time, since there is a 512M bytes memory we can use to temporarily buffer the data. The sending agent can transfer the data when the PCI is available. If the packet size is too big, it will occupy the PCI bus for long time and nobody can use the PCI bus. Moreover, if the transfer speed of NIC is slower than the speed of PCI, and the buffer of NIC is too small to cache all of transfer data, the gathering agent has to wait NIC transfer complete, and then it can use the PCI bus. The PCI resource is wasted on waiting NIC transfer. So there is a tradeoff between large and small packet sizes transferred in the PCI bus. We talk about it in next section. The Sending Agent is like batch processes. The PCI requesting procedure is shown in figure 4.9. PCI request Sending Scheduler Reponse Figure 4.9 Sending Agent PCI request scheme The Command agent accepts administrator s command and executes it. This agent should be assigned the highest priority. As the Linux interactive processes, we should give this agent the highest priority and let it get system control right whenever an administrator wants to control this system by this agent. To make all agents have chance to use PCI bus, the scheduler will dynamically update priority value for each agent in a reasonable range. If we use a fixed priority strategy, the gathering agent is assigned a higher priority while an interrupt comes in. So it will get the 28

36 PCI resources whenever FIFO data is available, even if the Ping-Pong buffer is almost full (the sending agent does not have chance to forward data to NIC). Therefore, the gathering agent can only transfer a little amount of data (small packet size), because there is a little of space free in Ping-Pong buffer. This will dramatically reduce the PCI performance when lots of small packet data is transferred in DMA mode. Therefore, we need to recalculate priority value to let the sending agent have chance to use the PCI bus to forward data to NIC and leave enough space for data transfer of large size. In figure 4.10, we add Updatepriority() function into simple scheduler agent. We will talk about how to update priority in the recalculate priority section. PriorityScheduler while msg = Getmessage() UpdateStatus() UpdatePriority() if (no agents running){ GetRequest() Response() } GoToSleep() end //Highest priority Figure 4.10 Dynamic Priority Design Approved Transfer Quantum The transfer quantum of PCI bus is also critical for system performance: it should be neither too long nor too short. If the transfer quantum is too short, the DMA overhead caused by setting up channel becomes excessively high. If the transfer quantum is too long, PCI bus no longer appears to be used concurrently. For example, if we give 5 seconds to the sending agent to transfer data, meanwhile the command agent cannot control the PCI bus and the gathering agent cannot fetch data from event FIFO in this period. Therefore, it reduces system response sensitivity when increasing transfer rate. 29

37 Data flood is another problem that we have to consider. When a huge chunk of data is suddenly fetched into the system, the sending agent needs to forward the data to Event Builder as soon as possible through Ethernet. In this period a collision is possible to happen on Ethernet, because there will be thousands of this kind of systems working together. If the data flood always happens, it will dramatically reduce Ethernet performance. Therefore, we need to assign a suitable transfer quantum to gathering agent and sending agent. In the Linux system, a scheduler algorithm allocates various amounts of CPU time slices to the processes. In our system, a scheduler agent allocates PCI time slices by limiting the amount of data an agent can transfer through the PCI bus. This value is also dynamically updated according to data transfer rate. An experimental result is shown in figure We did this test in a real system that is setup in High Energy Accelerator Research Organization (KEK). And the DMA mode of the PCI bus is used for data transfer. When the packet size is greater than 1k bytes, the transfer rate of the PCI bus increases slowly. Before this point, the transfer rate grows dramatically with the increasing of the packet size. In our system, a typical event size is 419 bytes, so we need to transfer more than two events per packet at least to get better data transfer rate. When there is lots of data need to be transferred, we increase transfer quantum (increase transfer rate, decrease response sensitivity); while we decrease transfer quantum (increase data threshold sensitivity, decrease transfer rate) when there is little data available. 1K is the key point. Below 1K, we should decrease it slowly. 30

38 Typical event size Figure 4.11 Data Transfer Result Recalculating Priority In section 4.7.2, we have talked about how to assign initial priority values to different agents according to the features of their action. To avoid having high priority agents always occupying the PCI resource, we need to dynamically adjust the priority values. In other words, the bonus and penalty points are given to different agents according to how much time they have used the PCI resources. However, it is very hard to get the exact number of the PCI time slices that an agent has used before. We use the amount of transferred event data to represent it. The Processing Agent transfers data from the event data queue to the Ping Pong buffer and compresses it. Before the Sending Agent forwards this compressed data to NIC, the compressed data is temporarily stored in the Ping Pong buffer. Therefore, the critical resource on CPU board is the Ping Pong buffer. Here, we ignore the compression and transfer time in memory. We only give one possible form of the priority updating formula. The concurrent priority value formula is: 31

39 P= P ± ( RS ) Formula 4.1 I g t P I is initial priority value, g is bonus and penalty gradient and R is compression rate, S is the free memory size of the Ping Pong buffer. As we have discussed before, we give greater initial priority to the gathering agent. The priority formula of the gathering agent is P = P R S ) I ( gather) ( gather t g. Normally, the coming data from Event FIFO is plain data, and R gather of gathering agent is 1. And the g priority formula of the sending agent is P = P I( send) + ( RsendSt). PI is the initial priority value. In these formulas, when the gathering agent is fetching data, S will be greater since fetched data is saved into Ping-Pong buffer. Therefore penalty g ( ) ( ) g RsendSt Rsend S t + 1 is given to the gathering agent. At the same time the priority of the sending agent is given the same amount of bonus. After the sending agent forwards data to NIC, more Ping Pong buffer is free and the S g g is increased. Therefore, ( RsendSt ) ( Rsend S t + 1) bonus is given to gathering agent. And the same amount of penalty is given to the sending agent. The g is bonus and penalty gradient. We can control the updating arrange of priority by this parameter. When g is larger, the priority is more sensitive on the free size of Ping Pong buffer. P I is another thing that affects priority. This defines the initial distance between different agents. P I does not need to be calculated concurrently as P. It is a practical value. Normally, we can calculate it based on the Ping Pong buffer s capacity C, and priority distance. R is the compression rate of the transfer data. This Readout system is an experiment 32

40 equipment with great similarity in transfer data, so the compression rate is easy to enhanced by choosing a suitable compression algorithm. We should consider this factor. 120 gather priority send priority gather priority send priority Priority Ping Pong buffer size Ping Pong buffer size Figure 4.12 Dynamically updating priority On the left side of figure 4.12, P is assigned to 100 for gathering agent, and P is I assigned 0 for sending agent and the gradient is 2. On the right side of figure 4.12 right side, I PI is assigned to 10 for gathering agent, PI is assigned to 0 for sending agent and the gradient is 1. Both compression rate are 0.5 (compressed data to plant data = 1:2). We can see that the curves on the left hand increase and decrease quickly than on the right one. The compression rate affects the slope of priority curve. The command agent rarely needs to use the PCI bus but needs to get response quickly, so we give it highest initial priority without priority updating. The Scheduler agent decides transfer quantum of other agents. The transfer quantum is a much simpler calculation because dynamic priority is already based on the Ping-Pong buffer size and data compression rate. Therefore, the transfer quantum is set according to the dynamic priority value. For example, when the priority value of the Sending Agent is big, it means there is a large amount of data in the Ping-Pong buffer. We need to give more 33

41 transfer quantum to the Sending Agent. The higher task s priority, the more transfer quantum is assigned per round of PCI transferring. 4.9 Dynamic Threshold Updating To optimize the PCI bus utilization, we implement an interrupt mechanism in the PCI bus control firmware as mentioned in section Here we discuss how to dynamically choose an appropriate threshold value to improve PCI throughput. As we have said in PCI scheduler section, when the Gathering Agent gets a PCI interrupt, it sends an interrupt of PCI request to Scheduler Agent. After getting access permission, it transfers data from FIFOs to memory. And then, the Gathering Agent enters waiting period for the next interrupt coming. In Hardware Improvement section, we ve mentioned that a threshold register controls the interrupt. When the number of event data in Event FIFO is equal or greater than this threshold value, the PCI controller will send an interrupt to CPU. If this threshold value is too small, the Gathering Agent would always try to occupy PCI bus and there would be more DMA overhead that exhausts PCI bus resource. In Figure 4.11, we have seen the relation between data transfer rate and data packet size, so we should make this threshold value large. On the other hand, if the threshold is too large, it would generate data flood in PCI bus and block other agents to use PCI bus when the PCI resource is occupied. And also, before the data size reach threshold value, PCI bus is idle (wasting PCI) if command and sending agents do not apply PCI resource. To avoid these problems, we need a threshold updating strategy. In figure 4.11, we have seen that 416 bytes is the typical event size. And we use this value as the minimum 34

42 threshold value. The event FIFOs capacity is 4M bytes. We do not want to block data transfer to event FIFOs, so the threshold should be smaller than this value. Moreover, it needs almost 50ms to transfer 4M bytes data in PCI bus. In this high-speed system, this gap is too big to accept, thus choose 2M bytes as maximum threshold value. After we defined the threshold-updating limitation, we need to choose an updating formula to match some requirements. First of all, because threshold modification need spend some PCI cycles, we should reduce PCI modification times when there are lots data need to be transferred. Secondly, we need a sensitive strategy that can adjust threshold value in time when there is a little data, since we do not wish that PCI were idle for a long time. Therefore, the threshold updating formula is Th = Th ± Th, When Th 412 and Th 2M Formula 4.2 /2 n n 1 n 1 The basic idea of above formula is that the gathering agent checks the data size of waiting transfer in the event FIFOs, if it is greater or less than half of previous threshold value, we update threshold to one and a half or half of previous one. When there is little data need to be transferred, Th n is small. And it is more sensitive to detect the data size reaches updating value, so Th n updating is very fast. In another word, more PCI cycles are used on threshold updating. Therefore, we got a sensitive threshold updating strategy when the amount of transfer data is small. From another way, when Th n is stay in a relative stable situation (large threshold value), a little bit data size change cannot trigger threshold modification action. For example, if Th n is 1M bytes, the condition of threshold updating is that the waiting transfer data size is 500K bytes more or less than Th n. It is not easy to reach this value, if there are only a few pulses. Therefore, we got a more stable threshold updating strategy, and save more PCI 35

43 resource for data transfer. The algorithm of the threshold updating in PCI interrupt service is demonstrated in figure PCI Interrupt_Service { } PCI_Request If Wait(PCI_Response){ } Event data transfer Check(event FIFO length) If FIFO_length > Threshold/2 UpdataThreshold (Threshold *1.5) Figure 4.13 PCI Interrupt Service algorithm When a PCI interrupt is received from PCI controller, the gathering agent startup this PCI interrupt service program. PCI request is the first step to obtain PCI resource from scheduler agent. And then, the gathering agent is going to wait for approval. After received approval, the gathering Agent knows how much data can be transferred this time. Normally, the amount of the threshold value can be sent once, not more than that. Then the gathering agent begins to transfer data from event FIFOs to memory. Because the event FIFOs are bidirectional memory chips, data can be written into them when other processes are reading data from them. After data transfer, the gathering agent needs to update threshold value if the value of event FIFO length register is greater than half of threshold value. And the threshold value is updated one and half of previous one. Next, we illustrate how to reduce threshold value through self-checking algorithm in figure

44 Self_Checking { PCI_Request If Wait(PCI_Response) Check(event FIFO length) If FIFO_length < Threshold/2 UpdataThreshold (FIFO_length/2) } Figure 4.14 Self-checking algorithm Normally, when there is not PCI interrupt coming in, the gathering agent should be in idle status to wait for interrupt. But to update threshold to satisfy current situation (avoid PCI idle), the gathering agent send a PCI request with low priority to apply to check the values of event FIFOs length registers when it is in idle status. If there are not other agents who want to apply PCI in the same time, the gathering agent will get PCI resource and fetch event FIFO length registers. If the event FIFOs length is less than half of current threshold value, the threshold will be reduced to the half of current value. 37

45 Chapter 5 Discussion This chapter discusses the performance improvements achieved using simulations. In figure 4.11, we have discussed the relationship between packet size and transfer rate. When the packet size is increased, the transfer rate of PCI bus is improved dramatically. In section 2.3, we noticed that this Readout system takes two steps to transfer from the event FIFO to the NIC (event FIFO to memory, and memory to NIC). It is only used for analyzing the features of new particles, and the data the system detects are even and well regulated. Therefore, the data compression rate of the data would be high and it s very helpful for reducing traffic and improving system throughput. Next, we will compare the Readout system throughput both without agents and with an agent-based design. First of all, to analyze the relationship between data compression and throughput, we fixed the first step (event FIFO to memory) of the transfer process. The fetching event data transfer rate is fixed to 67MB/s when packet size is 416 bytes (a typical event size). In formula 5.1, the system throughput is the amount of data transferred through this system divided by the transfer time. Datatransfer ( MB) Throughput = Time () s (Formula 5.1) transfer If we fetch data for 1 second from event FIFO to memory, the transfer data size is Datatransfer = 67MB, and the transfer time is 38

46 Time transfer Datatransfer ( MB) 1 = 1+ (Formula 5.2) Rate TransRate ( MB / s) compression NIC Here, we only take into account data transfer time. We suppose that the compression time of the Processing Agent is 0. Actually, the multithread design can save more compression time than single thread design. TransRate NIC is data transfer rate from memory to NIC. Because we have mentioned that there is compression in this system, so before we forward data from memory to NIC, the transfer data size is compressed to Data Rate transfer ( MB). At same time, the packet size is also decreased, because we suppose per compression event per packet. Therefore, we need to use the new transfer rate TransRate ( / ) NIC MB s to calculate transfer time between memory and NIC. Rate FIFO-MEMO (BM/s) MEMO-NIC (BM/s) Throughput (BM/s) Table 5.1 without agent-based design Table 5.1 is the analysis result that we get from simulating result in figure 4.11 and formula 5.1 and 5.2 without agent-based design. Here, the data transfer is a single thread process which fetches one packet from event FIFO, compresses the packet, and forwards it to NIC. The first column of table 5.1 is compression rate in processing agent. The second one is transfer rate from event FIFO to memory. The third one is transfer rate from memory to network card. The last column of this table is the throughput of the entire system. We find the data compression process only improve system performance a little. The reason caused this is the transfer rate (from memory to NIC) decreased dramatically when the packet size is compressed to a small size (figure 4.11). In another word, the DMA channel 39

47 setup wastes most PCI resource. The data in table 5.2 and 5.3 are the expected result when we use agent-based design and a Ping Pong buffer (figure 4.6) that can hold the data to be transferred from memory to NIC (one packet including many compressed events). The packet size is 1K and 2K, so the transfer rate from memory to NIC is 73MB/s (in table 5.2) and 77MB/s (in table 5.3). Compared to table 5.1, we notice that the agent-based design can fully utilize data compression, and dramatically improve system throughput. Rate FIFO-MEMO (BM/s) MEMO-NIC (BM/s) Throughput (BM/s) Table 5.2 agent-based design 73MB/s Rate FIFO-MEMO (BM/s) MEMO-NIC (BM/s) Throughput (BM/s) Table 5.3 agent-based design 77MB/s Based on above analysis, we have understood that the benefit of agent-based design for data transfer from memory to NIC. Next, we want to illustrate the throughput improvement by agent-based design in the entire system (from event FIFO to memory transfer, and from memory to NIC). Rate FIFO-MEMO (BM/s) MEMO-NIC (BM/s) Throughput (BM/s) Table 5.4 agent-based design FIFO-MEMO 77MB/s, MEMO-NIC 77MB/s Based on formula 5.1 and 5.2, we got simulation result in table 5.4. We increased the packet size in first transfer step from 419 (one event per packet) to 4kB (5 events per packet), and then the transfer rate from event FIFO to memory is increased to 77MB/s. The 40

throughput of the entire system is also greatly enhanced. The comparison of the system performance without and with agent design is demonstrated in figure 5.1.

48 throughput of the entire system is also greatly enhanced. The comparison of the system performance without and with agent design is demonstrated in figure 5.1. After we use the agent-based design in the second step transfer, we get a big jump in the system throughput (comparing cube and round nodes curves in figure 5.1). But if we use agent only in the second step and keep per event per packet in first step, in figure 5.1 the throughput is only increased a little while the second step transfer rate is increased from 73MB/s to 77MB/s (comparing round and cross nodes curves). And we find the first step transfer is the most PCI resource consumer. Finally, if the agent-based design is used in the entire process (increasing transfer rate in first step and second step), the throughput is greatly improved (comparing cross and star nodes curves). Through above analysis, we expect that the agent-base design can dramatically improve this Readout system throughput. Figure 5.1 compression rate / Throughput To prove above calculation result, we did a software simulation in the embedded system without DMA transfer, because the DMA transfer is not available on our current 41

Development of a PCI Based Data Acquisition Platform for High Intensity Accelerator Experiments

Development of a PCI Based Data Acquisition Platform for High Intensity Accelerator Experiments T. Higuchi, H. Fujii, M. Ikeno, Y. Igarashi, E. Inoue, R. Itoh, H. Kodama, T. Murakami, M. Nakao, K. Nakayoshi,