FPGA DESIGN OF A MULTICORE NEUROMORPHIC PROCESSING SYSTEM. Thesis. Submitted to. The School of Engineering of the UNIVERSITY OF DAYTON

Size: px

Start display at page:

Download "FPGA DESIGN OF A MULTICORE NEUROMORPHIC PROCESSING SYSTEM. Thesis. Submitted to. The School of Engineering of the UNIVERSITY OF DAYTON"

Bruce Blake
6 years ago
Views:

1 FPGA DESIGN OF A MULTICORE NEUROMORPHIC PROCESSING SYSTEM Thesis Submitted to The School of Engineering of the UNIVERSITY OF DAYTON In Partial Fulfillment of the Requirements for The Degree of Master of Science in Electrical Engineering By Bin Zhang Dayton, Ohio May, 2016

2 FPGA DESIGN OF A MULTICORE NEUROMORPHIC PROCESSING SYSTEM Name: Zhang, Bin APPROVED BY: Tarek M Taha, Ph.D. Advisory Committee Chairman Associate Professor Electrical and Computer Engineering Keigo Hirakawa, Ph.D. Committee Member Assistant Professor Electrical and Computer Engineering Eric Balster, Ph.D. Committee Member Assistant Professor Electrical and Computer Engineering John G. Weber, Ph.D. Associate Dean School of Engineering Eddy M. Rojas, Ph.D., M.A., P.E. Dean, School of Engineering ii

4 ABSTRACT FPGA DESIGN OF A MULTICORE NEUROMORPHIC PROCESSING SYSTEM Name: Zhang, Bin University of Dayton Advisor: Dr. Tarek M. Taha Neuromorphic computing architecture has developed rapidly during recent years. Neuronmorphic network processor FPGA implementation is 3x and 127x faster than Intel E8400 processor with edge detection applications and ECG applications respectively. Considering resource utilization and system stability, a hardware-controlled communication routing network is a better choice than a time-delay based routing network. The separation of data lines prevents the hardware-controlled communication routing network from turning into a large network. iv

5 ACKNOWLEDGMENTS In this work, the basic static router was designed by our team. The new static routing design and comparison was accomplished by myself. I would like to express deep appreciation and thanks to my adviser Professor Tarek M. Taha, who has the attitude and skills of a true scientist. His approach to doing research was highly inspiring. I would also like to thank my teammates, Yangjie Qi and Hua Chen for the efforts made in this work. I would also like to thank Md Raqibul Hasan who gives us lots of useful advice and help. v

6 To my mentor and dear friends at UD. Thank you for all of your support along the way. vi

7 TABLE OF CONTENTS ABSTRACT... iv ACKNOWLEDGMENTS... v DEDICATION... vi LIST OF FIGURES... ix LIST OF TABLES... x CHAPTER 1 INTRODUCTION NEURAL PROCESSOR WORK PROCESSING MULTICORE NEUROMORPHIC PROCESSOR... 3 CHAPTER 2 ARCHITECTURE DESIGN OF CORE AND BASIC NETWORK CORE DESIGN ROUTER DESIGN... 9 CHAPTER 3 STATIC ROUTING DESIGN BASIC DESIGN OF STATIC ROUTING STATIC ROUTING DESIGN PROBLEM TIME DELAY BASED ROUTING PROTOCOL ROUTER HARDWARE CONTROLLED COMMUNICATION PROTOCOL ROUTER CHAPTER 4 EXPERIMENTAL SETUP CHAPTER 5 RESULTS vii

8 5.1 IMPLEMENTATION OF NEURAL CORE AND NETWORKS COMPARISON OF TWO KINDS OF STATIC ROUTING PROTOCOLS TWO APPLICATIONS RESULTS VERIFICATION PERFORMANCE COMPARISON WITH A RISC PROCESSOR CHAPTER 6 CONCLUSION BIBLIOGRAPHY viii

9 LIST OF FIGURES Figure 1.1: Example of the neural network structure Figure 1.2: Digital design of neuromorphic core Figure 1.3: Multicore neuromorphic network Figure 2.1: Data bus Figure 2.2: State machine Figure 2.3: Structure of dynamic router Figure 3.1: Static routing Figure 3.2: Multi-sources conflict Figure 3.3: A simple handshaking protocol Figure 3.4: Hardware controlled communication protocol router signals Figure 5.1: Example of SignalTab II results ix

10 LIST OF TABLES 1.1: Comparison of neuromorphic systems and traditional computing platforms : Single core FPGA resource utilization : Multicore FPGA resource utilization : Dynamic and static network FPGA resource utilization comparison : Two kinds of routing network FPGA resource utilization : ECG process time of one pattern data : ECG process result of one pattern data : Edge detection result of some random pixels : Applications throughput x

11 CHAPTER 1 INTRODUCTION Neuromorphic computing architectures have been rapidly developed in recent years. This special architecture is suitable for large parallel processing applications, such as image processing, synthesis (RMS), and pattern recognition. John et al. [1] showed that four applications, the autonomous virtual robot driver, pong player, virtual digit recognition, and auto-association, work well on a single core containing 1024 axons, and synapses. 1.1 NEURAL PROCESSOR WORK PROCESSING There are two basic forms of neural networks, feed-forward and feed-back. The feedback form is used to train the system and modify weight. The feed-forward form is the most common form, as shown in Figure

12 Figure 1.1: Example of the neural network structure. The process of one layer represents the work of one core. The axons are represented by xi, neurons by yj, and the weight value of the i th axon for the j th neuron is Wij. Each axon applies its weight value of the corresponding neuron. The activation of the sum of every axon product is the final value for each neuron. This process can be evaluated as: vj= iwi,jxi + b. (1.1) yj=f(vj). (1.2) In Equation (1.2), f is an activation of outputs and is normally a sigmoid function, ff(xx) = ee xx, (1.3) and b is the bias. The network will modify its weight by this bias. The reason for using a sigmoid function is to keep the output value within the range of [0, 1]. For a layer of neural network with i axons and j neurons, the weight is i j. As data transfer costs energy, the cost of the transfer of a large amount of data is high. However, instead of moving large amounts of weight data, neural networks store weight data inside each core, which reduces the power consumption compared to other systems. 2

13 1.2 MULTICORE NEUROMORPHIC PROCESSOR The digital design of the core is shown in Figure 1.2. The transmission from pre-synaptic neuron inputs to post-synaptic neuron outputs in a neuromorphic core is similar to signal transmission between nerve cells in the human nervous system. The basic work structure is shown in Figure 1.1. Pre-synaptic neuron inputs coming in over routing network from other cores Pre-synaptic neuron number i Decoder W ij Digital synaptic memory array (i, x i ) Input Unit x i Pre-synaptic neuron value Multiply-addaccumulate to calculate neural outputs acc acc acc acc Control Unit Activation Function Routing Unit Post-synaptic neuron outputs to routing network Figure 1.2: Digital design of neuromorphic core. Neuromorphic processors can be built in an on-chip routing network [2, 3, 4], as shown in Figure 1.3. NC NC NC NC R R R R NC NC NC NC R R R R NC NC NC NC R R R R NC NC NC NC R R R R Figure 1.3: Multicore neuromorphic network. 3

14 In Figure 1.3, each neural core connects a router which has four direction connections to other routers. To assess the performance of the designed neural cores, we compared their estimated performance and power consumption with currently-used high performance processors. Two processor platforms were examined: the six core Intel Xeon X5650 processor and the NVIDIA Tesla M2070 GPGPU. We measured the peak neurons per second throughput of these systems [5]. Table 1.1: Comparison of neuromorphic systems and traditional computing platforms. Power Density (mw/ mm 2 ) System power eff. over Xeon # of Chip area % Time Power Configuration chips (mm 2 ) active (W) Memristor Core % ,867 Static random-access memory (SRAM) Core % ,049 NVIDIA M % Intel Xeon X % Table 1.1 compares the performance of the four systems considered when evaluating a neural network with 25,600 neurons, with each neuron having 1024 inputs and the full neural network being evaluated at 100,000 iterations/s. The results show that the specialized neuromorphic systems provide significantly more power and area efficiency compared to the traditional high performance computer platforms when running neural networks. In this work we present a field-programmable gate array (FPGA) implementation of the multicore digital neuromorphic processor. The system is evaluated for several applications and is compared with Intel processors. The design of the digital neuromorphic system has a great deal of similarity with the analog memristor 4

15 neuromorphic system in terms of the control logic in the cores, the input and outputs to the cores, and the routing systems. The key difference is in each core s neuron compute circuits (memristor based or SRAM/adder/multiplier based). Hence, designing the digital design on an FPGA will help determine the peripheral and routing logic for the analog cores as well. 5

16 CHAPTER 2 ARCHITECTURE DESIGN OF CORE AND BASIC NETWORK 2.1 CORE DESIGN As Figure 1.2 shows the neuromorphic core contains the whole process of the feed forward neural network. Each core processes a collection of N neurons, with each neuron having up to M input axons. The input synaptic weights (Wi,j) are stored in a memory array. These synaptic values are multiplied with the pre-synaptic input values (xi) and are summed into an accumulator. Once the final output neural values are generated, they pass through an activation function unit that implements Equation (2). The output of the activation unit goes to a routing unit on the core that looks up the destination of the neuron and sends a packet to the on-chip router with the neuron output and neuron destination Components function In order to complete the process, this neuromorphic core is separated into six components: input dispatch, weight memory, calculation, control unit, activation function and output package. The basic function of each component is given below: 6

17 (1) Data bus The data bus between each core consists of three parts: valid bit, address bit and data bit, as shown in Figure 2.1. Figure 2.1: Data bus. The most significant position is valid bit which decides whether data is valid or not. Address bits contains the next level router coordinate and weight address. Data bits for an axon are the pre-synaptic input values, and for a neuron, Data bits are the output of the activation. (2) Input dispatch This component decides whether the data coming in is useful or not by testing the value of the valid bit. If the data is valid, it sends the address bits to weight memory and the data bits to calculation and, in the meantime, activates the control unit. (3) Weight memory This component stores the weight value in a memory array. The weight memory receives the address bits from the input dispatch, finds the corresponding weight value and sends the weight value to calculation. (4) Calculation For each neuron, this component carries out a calculation based on Equation (1.1). When the calculation finishes, the process pauses until the destination core in the next layer is available to receive data. Then the calculation results are sent from each register to the Activation Function. 7

18 (5) Activation function This part processes Equation (1.2). In order to reduce unnecessary exponent calculations, a look up table has been built. With this method, the number of logic elements that are required for the calculation is reduced, but the number of registers used to store data is increased. (6) Output package This component packages all the messages into data bus format: valid bit; address bits which come from router table; and data bits which come from the calculation. Then these messages are sent out. (7) Control unit This component controls all the components mentioned above and makes them work in order. The control unit includes a state machine and control components which provide the total neuron number and delay time. The state machine is a Moore machine as shown in Figure

Figure 2.2: State machine. Basically, this state machine contains seven states: IDLE, Input Calculate, Bias Calculate, Calculate Finish, Holding, ALU Read, Output Read and Finish. 2.2 ROUTER DESIGN Two possible approaches to implementing the routing network are static and dynamic routing.

19 Figure 2.2: State machine. Basically, this state machine contains seven states: IDLE, Input Calculate, Bias Calculate, Calculate Finish, Holding, ALU Read, Output Read and Finish. 2.2 ROUTER DESIGN Two possible approaches to implementing the routing network are static and dynamic routing. The design of static routing will be explored in chapter 3. In dynamic routing, each core sends out a packet with a destination header. This packet header is examined by each router it passes through to direct the packet towards its destination. Dynamic routing is generally resource and power intensive, requiring buffers, a crossbar switch, and a switch allocator per router. The structure of a dynamic router is shown in Figure

Every clock cycle, five input port data is dispatched to its

20 Figure 2.3: Structure of dynamic router. In Figure 2.3, there are five input ports and five output ports. Every clock cycle, five input port data is dispatched to its destination direction buffer, and each direction output port sends out one message. 10

21 CHAPTER 3 STATIC ROUTING DESIGN 3.1 BASIC DESIGN OF STATIC ROUTING In static routing, a dedicated connection is set up between a source core and its destination cores. When a particular neural network is mapped onto the multi-core system, the communication pattern between the cores becomes deterministic. Thus the connectivity needed between the cores is pre-determined, and therefore, static routing between the cores can be utilized (similar to routing between configurable logic blocks on an FPGA). This approach requires a routing switch which is usually an n-type metaloxide-semiconductor field effect transistors (MOSFETs) in this design. Each connection within the routing switch requires a memory cell to enable reconfiguration of the path for a particular network (Figure 3.1). The reconfiguration process starts at the beginning of the whole system. 11

22 5x5 Crossbar Initialization buffer Figure 3.1: Static routing. The key benefit of static routing is that it does not require dynamic routing logic. This can significantly reduce the power consumption. If the channel utilizations are low, then the area of static routing could be larger than dynamic routing. A previous study showed that the power and area consumption of static routers is significantly less than those of dynamic routers [5]. In order to improve the quality of FPGA implementation, the multicore neuromorphic cores will be connected on an FPGA board with static routing. 3.2 STATIC ROUTING DESIGN PROBLEM Even though the benefit of static routing is obvious, our original static routing design suffers from a big disadvantage. When multiple cores are sending data to one core at the same time, as Figure 3.2 shows, the input port of the destination core cannot receive those different data at the same time because of the simplicity of the structure and a lack of signal transmission arrangement. 12

Figure 3.2: Multi-sources conflict. In Figure 3.2, three directions routing switches are turned on. It is impossible to control these switches during the process of data transfer.

(2) In order to ensure that this system works efficiently with a limited chip area, a balance between network size and router usage frequency is necessary.

23 Figure 3.2: Multi-sources conflict. In Figure 3.2, three directions routing switches are turned on. It is impossible to control these switches during the process of data transfer. As a result, data conflict is inevitable in this design. (1) For a neural network, both sending data to one core from multiple cores and sending data to multiple cores from one core is frequent. (2) In order to ensure that this system works efficiently with a limited chip area, a balance between network size and router usage frequency is necessary. Based on the two points above, there will be significant conflict between two cores of a static router, which is a critical problem. In order to solve this problem, two approaches are used, time delay based routing protocol and hardware controlled communication protocol. 13

24 3.3 TIME DELAY BASED ROUTING PROTOCOL ROUTER The most significant feature of static routing is pre-determined. Not only the routing switches, but the circuits between all routers are scheduled before system processing. For multi-sources conflict, if data from different sources are sent at different times, and there is only one data transfer process going on at one time, conflict will not occur. During this process, in Figure 3.2, the up direction output port can transfer data packages A, B and C with separate periods. Set TA, TB and TC as the transfer time each source costs from sending the package head to the package tail. Then a possible time delay schedule for each source is: DA = 0; DB = TA; DC = TA + TB. This is a simple conflict issue which can be easily scheduled. Brandner et al. [6] designed a static routing system for a network-on-chip (NoC). Their design solves the problem caused by traditional communication methods among relevant cores. Their solution is very similar to this time delay protocol. They summarized two principles which are also suitable for this protocol: (1) No two routes start or end at the same time instant in the communication schedule; (2) Any two routes scheduled concurrently utilize disjoint sets of communication links [6]. Given these two principles, a fully considered time schedule is absolutely necessary. Using this method, in order to decrease system processing time, the transfer time for each package needs to be estimated as accurately as possible. On the other hand, even a single mistake in a time schedule will cause conflict. In order to decrease the possibility of these conflicts occurring, it is effective to add extra time to each package transfer time as an error tolerance. 14

Generally, the time schedule protocol needs a careful and fully considered schedule. But as the architecture is the same as the basic design, it fully inherits the advantage of static routing. 3.

In this case, an efficient communication between data senders and data receivers is established. This communication protocol is based on handshaking. 3.4.

25 Generally, the time schedule protocol needs a careful and fully considered schedule. But as the architecture is the same as the basic design, it fully inherits the advantage of static routing. 3.4 HARDWARE CONTROLLED COMMUNICATION PROTOCOL ROUTER For a multi-sources NoC conflict problem, a common solution is to use hardware communication protocol. In this case, an efficient communication between data senders and data receivers is established. This communication protocol is based on handshaking Handshaking Handshaking is a useful negotiation method between two components. A simple handshaking protocol only has two signals, a require signal and an acknowledge signal, as shown in Figure 3.3. Figure 3.3: A simple handshaking protocol. In Figure 3.3, the sender is about to send a message to the receiver. For this purpose, the sender sends a require signal to ask the receiver for permission. If the receiver is available to receive this message, it will send back an acknowledge signal which allow this transmission. Then the sender will send the message successfully and it will not conflict with other messages. 15

3.4.2 Hardware controlled communication protocol router design Using the handshaking protocol require signals, acknowledge signals, and data lines need to be on separate channels.

26 3.4.2 Hardware controlled communication protocol router design Using the handshaking protocol require signals, acknowledge signals, and data lines need to be on separate channels. Signal transmission for one router is shown in Figure 3.4. Figure 3.4: Hardware controlled communication protocol router signals. In Figure 3.4, as for the unidirectional line, each channel has two directions. Within one direction, there are require signals, acknowledge signals and a data line. The total number of signals of one router is Ws=2 Nc (2 + Wd). (3.1) Here, Ws is the sum of the width of all signals and Nc is the number of channels. For a router, there are three wires per channel for require and acknowledge signals and the data bus, which is the 2+Wd term in Equation (3.1), where Wd is the data bus width. As the number of cores in a neural network increases, the signal width within one router becomes large, which increases the required number of wires on the FPGA. 16

27 As for the handshaking protocol, each core sends a require signal to the next layer core after computing the neuron outputs. The next layer core will respond with an acknowledge signal. As shown in Figure 3.3, the sender will hold the neural output until receiving permission. With this protocol, effective communication between the cores completely solves the multi-sources conflict. 17

28 CHAPTER 4 EXPERIMENTAL SETUP The multicore neuromorphic processor was implemented on an Altera DE2 board which contains an Altera Cyclone IV FPGA (part EP4CE115). This neuromorphic network was programmed in Verilog and simulated using ModelSim. It was then compiled using Quartus II. Finally, it was tested on an FPGA board. The number of neurons per core, bits per neuron, and synapses per neuron were used as compile time variables so that different design options could be examined. The synaptic memory was implemented using the on-board memory within the FPGA. To evaluate the performance of the FPGA, we applied it to an edge detection process and electrocardiography (ECG) signal analysis. Descriptions of the applications are given below. Edge detection: This application aims to identify points in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. Changes or discontinuities in luminance values within images are fundamentally important primitive characteristics because they often provide an indication of the physical extent of objects within the image. To evaluate pixel values of the edged image for each pixel we utilized a neural network of configurations 9->11->1 (9 inputs, 11 neurons in the hidden layer and one output neuron), 9->11->1, and 2->5->1. 18

29 ECG: Several applications require a constant ECG to be carried out on a person. These include several variants of implanted heart devices, remote health monitoring systems for elderly people, or patients who use body sensors. These devices are operated by battery power and therefore it is extremely important for such devices to have extremely lowpower consumption. A study examined the Arrhythmia Data Set which consists of 220 patterns of 16 classes [7]. Each pattern consists of 279 attributes/features. For this application we utilized a 279->50->16 neural network configuration. 19

30 CHAPTER 5 RESULTS 5.1 IMPLEMENTATION OF NEURAL CORE AND NETWORKS Two versions of the core were implemented, one with 16 neurons per core and the other with 64 neurons per core. Both cores had 512 synapses per neuron and ran at 50MHz. Table 5.1 shows the FPGA resource utilization for these two cores. Table 5.1: Single core FPGA resource utilization. 16 neuron core 64 neuron core Logic elements 1,844 4,924 Registers 1,105 41,77 Memory bits 131, ,288 Multipliers This neuromorphic network is connected by time delay static routing on an FPGA. The FPGA was able to fit a 3 3 grid of the 16 neuron cores and a 2 2 grid of the 64 neuron cores (all at 50MHz). Table 5.2 shows the total FPGA resource utilization for these two multicore systems, taking both core and static routing logic into consideration. The results indicate that on-chip multipliers are the limiters to scaling the number of cores. 20

31 Table 5.2: Multicore FPGA resource utilization. Network for edge detection Network for ECG Routing method Static routing Static routing Core utilization 9 4 Neurons per core Logic elements 17,531 (15%) 19,828 (17%) Registers 9,896 (8%) 16,688 (14%) Memory bits 1,179,648 (29%) 2,097,152 (52%) Multipliers 288 (54%) 512 (96%) In order to compare dynamic routing and static routing, a 2 2 network which uses a 64- neuron core is also built on the FPGA board. The FPGA resource utilization of this network is shown in Table 5.3. Table 5.3: Dynamic and static network FPGA resource utilization comparison. Dynamic network Static network Core utilization 4 4 Neurons per core Total Logic Elements 45,889 (40%) 19,828 (17%) Total Registers 35,787(30%) 16,688 (14%) Total Memory Bits 2,101,052(52%) 2,097,152 (52%) Multipliers 512(96%) 512 (96%) The comparison of Table 5.3 shows the main advantage of static routing over dynamic routing. The number of logic elements used in dynamic routing is over twice that of static routing. In this design, each dynamic router contains more buffers and logic elements. Therefore, dynamic routing designs require a greater chip area compared to static routing designs. 21

32 5.2 COMPARISON OF TWO KINDS OF STATIC ROUTING PROTOCOLS A hardware controlled communication protocol static routing network is implemented on the FPGA. This network contains a 2 2 grid of the 64 neuron cores which is the same as the former network with a time delay routing protocol. Table 5.4: Two kinds of routing network FPGA resource utilization. 2 2 Hardware controlled Time delay routing 64 neuron cores routing Logic (20%) (17%) Registers (16%) (15%) Memory bits (53%) (53%) Multipliers 512 (96%) 512 (96%) Table 5.4 shows the total FPGA resource utilization for these two kinds of routing network. Even though the architecture of the hardware controlled router is more complex than that of the time delay router, their FPGA resource utilization are similar. Table 5.5 shows the process time of one pattern data in ECG application. Table 5.5: ECG process time of one pattern data neuron cores Process Time (cycles, 1 cycle = 200 ps) Hardware controlled routing 359 Time delay routing 357 In Table 5.5, the hardware controlled routing network takes a little more time compared to the time delay routing network. ECG, which is used in these two networks, is a simple network. It allows a time schedule to be built as easily and accurately as a straight coreto-core network without any routing. As for the two networks' close process time, their ECG application throughputs are close. 22

33 5.3 TWO APPLICATIONS RESULTS VERIFICATION Since the ECG application requires two layers of neurons, we utilized the 4 core system with 64 neurons per core on the FPGA. Each layer of neurons was simulated on one core. Since the edge detection application requires 4 layers of neurons, the 9 core system with 16 neurons per core was utilized. Both the first and the second layers used two cores each. The third and fourth layers used one core each. The routing network used in this system is a time delay based routing protocol static routing. On FPGA, the result of applications were verified using the SignalTab II tool in Quartus II as shown in Figure 5.1. Figure 5.1: Example of SignalTab II results. In Figure 5.1, the port OUT[15..0] represents the calculation result, which is a 16 bit fixed point value. The top eight bits represent the integer portion and last eight binary bits represent the fractional portion. In order to verify the FPGA results, the Matlab calculation results are shown in Table 5.6 and Table

34 Table 5.6: ECG process result of one pattern data. Class Number Software calculation FPGA calculation Table 5.7: Edge detection result of some random pixels. Software calculation FPGA calculation For ECG, the result of one pattern is shown in Table 5.6. Some randomly picked results are shown in Table 5.7. The calculation results are rounded from fixed point value to floating point value. The FPGA calculation matches the software calculation. The diagnosis for this ECG pattern of attributes is class 1. Based on Arrhythmia Data Set, class 1 is a normal ECG. 5.4 PERFORMANCE COMPARISON WITH A RISC PROCESSOR 24

35 We compared our FPGA performance with an implementation of the applications on an Intel E8400. The edge detection application was implemented in a non-neural network form as that would be most efficient on an RISC processor. The ECG application was implemented as a neural network. Table 5.8: Applications throughput. Application Intel E8400 FPGA Edge detection (million pixels/second) ECG (inputs/second) ,000 Table 5.8 shows the throughput achieved on the Intel processor compared to the FPGA. The results show that the FPGA implementation provided about 3x and 127x higher throughput than the Intel processor for the edge detection and ECG applications respectively. 25

36 CHAPTER 6 CONCLUSION This neuromorphic network processor's FPGA implementation is 3x and 127x faster than an Intel E8400 processor for the edge detection application and ECG application respectively. Considering resource utilization and system stability, a hardware controlled communication routing network is not a good choice. On the other hand, for a big and complex network, it will be difficult to schedule an efficient time delay network. The separation of data lines prevents the hardware controlled communication routing network from becoming a large network. If the communication method between cores becomes more comprehensive than the current simple handshaking protocol design, the data line may be able to be shared amongst all cores. But the high FPGA resource usage is a big problem. Considering those disadvantages of static routing, even though dynamic routing takes more resource utilization, but dynamic routing network is much more stable than static routing network, especially for some huge number of data applications. 26

37 BIBLIOGRAPHY [1] J, Merolla. P, Akopyan. F, et al., Building Block of a Programmable Neuromorphic Substrate: A Digital Neurosynaptic Core, International Joint Conference on Neural Networks (IJCNN), June [2] T. M. Taha, R. Hasan, C. Yakopcic, and M. R. McLean, Exploring the Design Space of Specialized Multicore Neural Processors, IEEE International Joint Conference on Neural Networks (IJCNN), [3] R. Hasan and T. M. Taha, Enabling Back Propagation Training of Memristor Crossbar Neuromorphic Processors, IEEE International Joint Conference on Neural Networks (IJCNN), [4] C. Yakopcic, R. Hasan, T. M. Taha, Efficacy of Memristive Crossbars for Neuromorphic Processors, IEEE International Joint Conference on Neural Networks (IJCNN), [5] R. Hasan and T. M. Taha, On-Chip Static vs. Dynamic Routing for Feed Forward Neural Networks on Multicore Neuromorphic Architectures, International Conference on Advances in Electrical Engineering (ICAEE), December [6] F. Brandner, M. Schoeberl, Static Routing in Symmetric Real-Time Network-on-Chips, [7] 27

FPGA BASED HIGH THROUGHPUT LOW POWER MULTI-CORE NEUROMORPHIC PROCESSOR. Thesis. Submitted to. The School of Engineering of the UNIVERSITY OF DAYTON

FPGA BASED HIGH THROUGHPUT LOW POWER MULTI-CORE NEUROMORPHIC PROCESSOR Thesis Submitted to The School of Engineering of the UNIVERSITY OF DAYTON In Partial Fulfillment of the Requirements for The Degree