RECONFIGURABLE COMPUTING: A DESIGN AND IMPLEMENTATION STUDY OF ELLIPTIC CURVE METHOD OF FACTORING USING SRC CARTE-C AND CELOXICA HANDEL-C

Size: px

Start display at page:

Download "RECONFIGURABLE COMPUTING: A DESIGN AND IMPLEMENTATION STUDY OF ELLIPTIC CURVE METHOD OF FACTORING USING SRC CARTE-C AND CELOXICA HANDEL-C"

Evangeline Smith
5 years ago
Views:

1 RECONFIGURABLE COMPUTING: A DESIGN AND IMPLEMENTATION STUDY OF ELLIPTIC CURVE METHOD OF FACTORING USING SRC CARTE-C AND CELOXICA HANDEL-C Committee: by Hoang Le A Thesis Submitted to the Graduate Faculty of George Mason University In Partial Fulfillment of The Requirements for the Degree of Master of Science Computer Engineering Dr. Kris Gaj, Thesis Director Dr. Ronald Barnes, Committee member Dr. David Hwang, Committee member Dr. Andre Manitius, Department Chairperson Dr. Lloyd J. Griffiths, Dean, The Volgenau School of Information Technology and Engineering Date: Spring Semester, 2007 George Mason University Fairfax, Virginia

2 Reconfigurable Computing: A Design and Implementation Study of Elliptic Curve Method of Factoring Using SRC Carte-C and Celoxica Handel-C A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science at George Mason University By Hoang Le Bachelor of Science, Computer Engineering George Mason University, 2005 Thesis Director: Dr. Kris Gaj, Associate Professor Department of Electrical and Computer Engineering George Mason University Spring Semester, 2007 Fairfax, VA

4 iii ACKNOWLEDGEMENTS As always, I owe thanks to a handful of people who devoted their time and expertise to me and to the creation of my Thesis. The following people were instrumental in this process and their inputs were invaluable. Their patience and willingness to help were surpassed only by their astounding knowledge of all the subject matter in question. I thank each and every one of them for their significant contributions. With deep gratitude, I would like to specifically acknowledge those who were an integral part of my writing: Dr. Kris Gaj, my advisor, whose knowledge and diversity of expertise never fails to impress me and is way too humble to know what an incredible educator he is. Almost as impressive are his patience in sharing that knowledge, and his prompt and thorough responses to my countless detailed questions. Dr. Soonhak Kwon, Dr. Patrick Baier, Paul Kohlbrenner, Ramakrishna Bachimanchi and Mohammed Khaleeluddin for their contributions in prototyping and designing the architecture. Mr. and Mrs. McNees, my friends, are quite simply the finest and the best. Thank you for your unfaltering support and input, and for being in the clutch with editorial and conceptual direction in this Thesis. I value our partnership more than words can express. Lastly, Tracy Vu, my soul-mate, who perpetually offered me creative input, enthusiastic support, and a reason to believe. You re my foundation, my strength, and my inspiration. I never take for granted how incredibly lucky I am to have you. I would also like to thank the staff at SRC Computers, who was a valuable resource in providing a state-of-the-art platform for this work to be performed on, and the great amount of time and assistance in debugging and technical support. Finally, my thanks go for the reconfigurable research groups at GMU and GWU.

5 iv TABLE OF CONTENTS ABSTRACT... xi CHAPTER 1 - INTRODUCTION HIGH PERFORMANCE COMPUTING RECONFIGURABLE SUPERCOMPUTING AND BENEFIT IN CRYPTOGRAPHY AND CRYPTANALYSIS MOTIVATION AND GOAL OF THIS THESIS... 4 CHAPTER 2 - RECONFIGURABLE COMPUTING PLATFORMS FIELD PROGRAMMABLE GATE ARRAYS (FPGAS) FPGAs: A High-Performance Complement to Microprocessor FPGA Vendors and Families FPGA Boards RECONFIGURABLE COMPUTERS SRC Architecture Communications Cray XD CHAPTER 3 - RECONFIGURABLE COMPUTING LANGUAGES AND PROGRAMMING ENVIRONMENTS TRADITIONAL DESIGN FLOW HDLs for Programming FPGAs HLLs for Programming Microprocessors Application Programming Interfaces (APIs) between FPGAs and Microprocessors SRC CARTE C Introduction Language Code Development Code Optimization User's Callable Macros CELOXICA HANDEL-C Introduction Language Code Development Efficient FPGA Resource Usage Code Optimization PAGE

6 v Some Restrictions When Using Handel-C and FPGAs Environment CHAPTER 4 - ELLIPTIC CURVE METHOD OF FACTORING RSA PUBLIC KEY CRYPTOSYSTEM RSA SECURITY FACTORIZATION Special Algorithm: Elliptic Curve Method (ECM) ECM: Overview ECM: Algorithm Operations on an Elliptic Curve Top Level: Scalar Multiplication Medium Level: Point Addition and Doubling Low Level: Montgomery Multiplication General Algorithm: Number Field Sieve (NFS) NFS: Overview ECM in NFS CHAPTER 5 - PREVIOUS WORK ON IMPLEMENTING APPLICATIONS IN HIGH LEVEL LANGUAGES FOR RECONFIGURABLE HARDWARE CHAPTER 6 - IMPLEMENTATION OF ECM USING SRC CARTE-C PARTITIONING THE CODE BETWEEN HIGH LEVEL LANGUAGE (HLL) AND HARDWARE DESCRIPTION LANGUAGE (HDL) CODING PARTITION 1: DATA COMMUNICATION MODULE IN CARTE-C Overview Architecture Results CODING PARTITION 2: DATA COMMUNICATION MODULE AND CONTROL UNIT IN CARTE-C Overview Architecture Results CHAPTER 7 - IMPLEMENTATION OF ECM USING CELOXICA HANDEL-C CODING PARTITION 1: CONTROL UNIT IN HANDEL-C Overview Architecture Result CODING PARTITION 2: ENTIRE ECM IN HANDEL-C Overview Architecture Result CHAPTER 8 - COMPARISON OF RESULTS COMPARISON OF RESULTS OF TWO SCHEMES IN SRC... 90

7 vi COMPARISON OF RESULTS OF TWO SCHEMES IN CELOXICA COMPARISON OF RESULT OF ONE COMMON SCHEME IN SRC AND CELOXICA CHAPTER 9 - COMPARISON BETWEEN LANGUAGES AND PROGRAMMING ENVIRONMENTS COMPARISON OF EXISTING HDL CORES AND METHODS OF THEIR INSTANTIATIONS COMPARISON BETWEEN SRC CARTE-C AND CELOXICA HANDEL-C LANGUAGES COMPARISON BETWEEN SRC CARTE AND CELOXICA DK4 DEVELOPMENT ENVIRONMENTS COMPARISON OF FEATURES USED IN THE DEVELOPMENT USING TWO LANGUAGES CHAPTER 10 - DISCUSSION OF RESULTS CHAPTER 11 - CONCLUSIONS BIBLIOGRAPHY CURRICULUM VITAE

8 vii LIST OF TABLES TABLE TABLE 1: CRAY XD1 TECHNICAL SPECIFICATIONS TABLE 2: SRC MACRO'S CHARACTERISTICS CHART TABLE 3: CONVERSION BETWEEN THE ORDINARY DOMAIN AND THE MONTGOMERY DOMAIN TABLE 4: EXECUTION TIME OF PHASE 1 AND PHASE 2 USING SRC-6 RECONFIGURABLE COMPUTER HOLDING 9 ECM UNITS FOR 198-BIT NUMBERS N; B1 = 960 (WHICH IMPLIES NUMBER OF BITS OF K; KN = 1375), B2 = 57000, AND D = 210 IN SRC SCHEME TABLE 5: PLACE & ROUTE SUMMARY IN SRC SCHEME TABLE 6: NUMBER OF LINES OF CODES IN SRC SCHEME TABLE 7: PLACE AND ROUTE SUMMARY FOR ONE ECM UNIT IN SRC SCHEME TABLE 8: NUMBER OF LINES OF CODES IN SRC SCHEME TABLE 9: PLACE & ROUTE SUMMARY FOR ONE UNIT IN HANDEL-C SCHEME TABLE 10: PLACE & ROUTE SUMMARY FOR TEN UNITS IN HANDEL-C SCHEME TABLE 11: FREQUENCY AND EXECUTION IN HANDEL-C SCHEME TABLE 12: NUMBER OF LINES OF CODES IN HANDEL-C SCHEME TABLE 13: PLACE & ROUTE SUMMARY FOR ONE UNIT IN HANDEL-C SCHEME TABLE 14: PLACE & ROUTE SUMMARY FOR ELEVEN UNITS IN HANDEL-C SCHEME TABLE 15: FREQUENCY AND EXECUTION IN HANDEL-C SCHEME TABLE 16: NUMBER OF LINES OF CODES IN HANDEL-C SCHEME TABLE 17: COMPARISON OF RESULTS OF TWO SCHEMES IN SRC TABLE 18: COMPARISON OF RESULTS OF TWO SCHEMES IN CELOXICA HANDEL-C TABLE 19: RESULTS OF HANDEL-C SCHEME 1 AND SRC SCHEME TABLE 20: COMPARISON OF EXISTING HDL CORES AND METHODS OF THEIR INSTANTIATIONS TABLE 21: COMPARISON BETWEEN SRC CARTE-C AND CELOXICA HANDEL-C LANGUAGES TABLE 22: COMPARISON BETWEEN SRC CARTE AND CELOXICA DK4 DEVELOPMENT ENVIRONMENTS TABLE 23: COMPARISON OF FEATURES USED IN TWO LANGUAGESGES PAGE

9 viii LIST OF FIGURES FIGURE FIGURE 1: GENERAL ARCHITECTURE OF A RECONFIGURABLE COMPUTER... 3 FIGURE 2: GENERAL ARCHITECTURE OF FPGA... 7 FIGURE 3: XILINX CLB'S STRUCTURE... 9 FIGURE 4: XILINX CLB SLICE'S STRUCTURE FIGURE 5: SRC HARDWARE ARCHITECTURE FIGURE 6: MAP'S INTERFACE ARCHITECTURE BLOCK DIAGRAM FIGURE 7: SNAP CARD FIGURE 8: ARCHITECTURE OF CRAY XD1 MACHINE FIGURE 9: DESIGN FLOW FOR FPGAS IN HDLS FIGURE 10: SOFTWARE DEVELOPMENT PROCESS FIGURE 11: COMMUNICATION BETWEEN MAIN PROGRAM IN HOST COMPUTER AND FPGA BOARDS USING APIS FIGURE 12: SRC PROGRAMMING MODEL FIGURE 13: LIBRARY DEVELOPMENT IN SRC FIGURE 14: SRC PARALLEL REGION FIGURE 15: SRC DEVELOPMENT PROCESS FIGURE 16: MAP COMPILATION PROCESS FIGURE 17: COMPARISON BETWEEN HANDEL-C AND ANSI-C FIGURE 18: BRANCHING AND RE-JOINING OF THE EXECUTION FLOW FIGURE 19: DESIGN FLOW IN HANDEL-C FIGURE 20: OPERATION OF SECRET-KEY CRYPTOSYSTEM FIGURE 21: OPERATION OF PUBLIC-KEY CRYPTOSYSTEM FIGURE 22: TRAP-DOOR ONE-WAY FUNCTION OF RSA FIGURE 23: HIERARCHY OF ELLIPTIC CURVE OPERATIONS FIGURE 24: GRAPHICAL ILLUSTRATION OF MONTGOMERY LADDER ALGORITHM, FOR THE CASE OF K = FIGURE 25: REQUIRED STEPS IN NFS FIGURE 26: SRC PROGRAM PARTITIONING FIGURE 27: GENERAL ARCHITECTURE OF ECM. NOTATION: MEM-MEMORY; M1; M2-MULTIPLIERS 1 AND 2; A/S- ADDER/SUBTRACTOR FIGURE 28: CODING PARTITION SCHEME 1. NOTATION: MEM-MEMORY; M1; M2-MULTIPLIERS 1 AND 2; A/S- ADDER/SUBTRACTOR FIGURE 29: CODING PARTITION SCHEME 2. NOTATION: MEM-MEMORY; M1; M2-MULTIPLIERS 1 AND 2; A/S- ADDER/SUBTRACTOR FIGURE 30: CODING PARTITION SCHEME 3. NOTATION: MEM-MEMORY; M1; M2-MULTIPLIERS 1 AND 2; A/S- ADDER/SUBTRACTOR PAGE

10 ix FIGURE 31: FOUR SCHEMES TO BE IMPLEMENTED IN THIS THESIS FIGURE 32: SRC SCHEME 1 - DATA COMMUNICATION MODULE IN MAP-C. NOTATION: DCM - DATA COMMUNICATION MODULE FIGURE 33: TOP LEVEL OF ECM UNIT IN SRC SCHEME FIGURE 34: BLOCK DIAGRAM OF THE TOP-LEVEL UNIT IN SRC SCHEME 1. NOTATION: MEM-MEMORY; M1; M2-MULTIPLIERS 1 AND 2; A/S-ADDER/SUBTRACTOR FIGURE 35: OVERALL RESULTS OF EXECUTION TIME IN SRC SCHEME FIGURE 36: SRC SCHEME 2 - CONTROL UNIT IN MAP-C. NOTATION: DCM - DATA COMMUNICATION MODULE FIGURE 37: TOP LEVEL OF ECM MODULE IN SRC SCHEME FIGURE 38: BLOCK DIAGRAM OF THE TOP-LEVEL UNIT IN SRC SCHEME 2. NOTATION: M1; M2-MULTIPLIERS 1 AND 2; A/S- ADDER/SUBTRACTOR FIGURE 39: HANDEL-C SCHEME 2 - CONTROL UNIT IN HANDEL-C. NOTATION: DCM - DATA COMMUNICATION MODULE FIGURE 40: HANDEL-C SCHEME 3 - ENTIRE ECM IN HANDEL-C. NOTATION: DCM - DATA COMMUNICATION MODULE FIGURE 41: DIFFERENT COMPARISONS OF ECM IMPLEMENTATION IN SRC CARTE AND CELOXICA HANDEL-C FIGURE 42: SRC CARTE COMPILER AREA PENALTY. NOTATION: CU CONTROL UNIT; U ECM UNIT FIGURE 43: GRAPHICAL COMPARISON OF TWO SRC SCHEMES FIGURE 44: CELOXICA HANDEL-C COMPILER AREA PENALTY NOTATION: CU - CONTROL UNIT; U: ECM UNIT FIGURE 45: GRAPHICAL COMPARISON OF TWO HANDEL-C SCHEMES FIGURE 46: COMPARISON OF AREA PENALTY IN SRC AND CELOXICA. NOTATION: CU - CONTROL UNIT; U: ECM UNIT FIGURE 47: GRAPHICAL COMPARISON OF HANDEL-C SCHEME 2 AND SRC SCHEME FIGURE 48: COMPARISONS OF ALL SCHEMES IN CARTE C AND HANDEL-C

11 x LIST OF ALGORITHMS ALGORITHM ALGORITHM 1: ECM ALGORITHM ALGORITHM 2: MONTGOMERY LADDER ALGORITHM ALGORITHM 3: ADDITION AND DOUBLING USING THE MONTGOMERY S FORM OF ELLIPTIC CURVE ALGORITHM 4: RADIX-2 MONTGOMERY MULTIPLICATION PAGE

12 ABSTRACT RECONFIGURABLE COMPUTING: A DESIGN AND IMPLEMENTATION STUDY OF ELLIPTIC CURVE METHOD OF FACTORING USING SRC CARTE-C AND CELOXICA HANDEL-C Hoang Le, MS George Mason University, 2007 Thesis Director: Dr. Kris Gaj High Level Languages (HLLs), especially C-based, for hardware/software system design has entered industrial flows. HLLs raise the level of abstraction to allow the designers to describe the desired functions in the quickest possible way, rather than focusing on the underlying details. HLL also facilitates hardware/software co-design, and shortens development time, which encourages complex experiments. Code partitioning is the task of indentifying some portion of the code to be implemented in HLLs, and leave the rest in Hardware Description Languages (HDLs). Application developers are willing to give up some performance and area utilization in exchange of productivity. With that said, in-depth comparison is needed, in which code partitioning should be analyzed to determine the optimal combination of HDLs and HLLs codes to resolve the development time vs. performance trade-off.

13 This thesis explores different coding partitions of Elliptic Curve Method (ECM) of factoring using SRC Carte-C and Handel-C. Two code partitioning approaches have been implemented and examined using the appropriate development environments, SRC Carte and Celoxica DK4 Development Kit. Similarities, differences and tradeoffs among the investigated languages and partitioning styles are also presented.

14 Chapter 1 - Introduction High Performance Computing Due to high computational demands, many research projects are plainly impossible to pursue without a High Performance Computing (HPC) platform, or they would take an unreasonable amount of time to complete. Most commonly used today HPC platforms include parallel supercomputers and computer clusters, that is, computing systems comprised of multiple general-purpose microprocessors linked together in a single system with high speed interconnects. This computing environment allows for development of concurrent high-performance solutions using traditional and concurrent programming languages and common application-specific APIs. However, the architecture of general purpose microprocessors, as it name suggests, is designed not only for one specific application, but for general computing. In addition, the instruction sets of microprocessors have fixed and limited lengths that may not match the sizes required by an application. Therefore, applications with large operand sizes often require multiple instruction executions, which result in a longer execution time. 1

15 2 The lack of speed optimization, along with the limitation in microprocessor architecture, has led to the need for an optimized, easy-to-manage and uniform platform. An HPC machine is also a convenient tool for serving the computational needs of many small projects, where installing the software and other similar tasks are much easier to do than they would be on a set of separate computers Reconfigurable Supercomputing and Benefit in Cryptography and Cryptanalysis The term reconfigurable computing refers to the concept of a computer consisting of one or more standard processors and an array of reconfigurable hardware. The general architecture of a Reconfigurable Computer is conceptualized in Figure 1. The main processor controls the behavior of the reconfigurable hardware. The reconfigurable hardware would then be tailored to perform a specific task, such as image processing or cryptography, as quickly as a dedicated piece of hardware. Once the task was done, the hardware could be reconfigured to do some other task. This architecture resulted in a hybrid computer structure combining the flexibility of software with the speed of hardware. Currently there are a number of vendors with commercially available reconfigurable computers aimed at the high performance computing market; including Cray, SGI and SRC Computers, Inc. A modern reconfigurable computer consists of traditional microprocessors coupled with user-programmable FPGAs. The systems can be used as

16 3 traditional cluster computers without using the FPGAs. As mentioned before, the real benefit of reconfigurable computer over traditional computer comes from the parallelism supported by FPGAs. By allowing the computationally intensive parts of the problem to run on FPGA we can achieve a very high throughput. In addition, a reconfigurable computer can be programmed using high-level programming languages, such as C, by mathematicians and scientists themselves. It can also shorten development time and encourage experimentation and complex optimizations. Microprocessor system Reconfigurable system µp µp FPGA FPGA µp Memory µp Memory FPGA Memory FPGA Memory I/O Interface Interface I/O Figure 1: General Architecture of a Reconfigurable Computer In cryptography, applications are commonly used to implement security services for two parties communicating between each other. Modern security protocols use modular arithmetic in encryption and decryption of messages. Such operations are known to be

17 4 resource and computational intensive on general-purpose microprocessors as they are not optimized for fast execution, especially in the case of public-key algorithms. FPGA implementations have the potential of running substantially faster than software implementations due to possible deep pipelining and parallelism Motivation and Goal of this Thesis There are a number of works that have been done to compare the performance and efficiency of designs in different devices and platforms. However, the majority of them have a common characteristic, and that is the cores have been implemented using Hardware Description Languages (HDLs). However, reconfigurable computers are becoming more and more popular along with hardware-targeted High Level Languages (HLLs). That raises the need for a deeper comparison, in which code partitioning should be analyzed to determine the optimal combination of HDLs and HLLs. This thesis explores different coding partitions of Elliptic Curve Method (ECM) of Factoring using SRC Carte-C and Handel-C. In this thesis, two code partitions will be implemented and examined using each development environment, SRC Carte and Celoxica DK4 Development Kit. Design choices and tradeoffs between them are also compared. This thesis is organized as follows: Chapter 2 introduces Reconfigurable Computing Platforms. Chapter 3 describes Reconfigurable Computing Languages and Programming Environments. Chapter 4 discusses the Elliptic Curve Method of Factoring. Chapter 5 describes some previous work on implementing applications in High Level

18 5 Languages for reconfigurable hardware. Chapter 6 and 7 portray implementation of different coding partitions of ECM using SRC Carte-C and Celoxica Handel-C. The comparisons of results, languages, and environments are given in Chapter 8 and 9. Finally, a conclusion is included in Chapter 10.

19 Chapter 2 - Reconfigurable Computing Platforms Field Programmable Gate Arrays (FPGAs) A field programmable gate array (FPGA) is a semiconductor device containing programmable logic components and programmable interconnects. The programmable logic components can be programmed by an end user to describe the functionality of basic logic gates such as,,, or more complex functions such as decoders or simple math functions. In its simplest form, a FPGA consists of an array of uncommitted elements that can be programmed or interconnected according to a user s specification. The ability to reprogram these devices over and over again and the flexibility of interconnection resources make FPGAs an ideal device for implementing and testing ASIC prototypes. The general architecture of a FPGA is described in Figure 2. Figure 3 shows a Xilinx CLB's Structure and Figure 4 shows structure of a Xilinx CLB Slice's. The most important components in an FPGA are configurable logic blocks (CLBs), input/output blocks (IOBs) and programmable interconnects. 6

20 7 The CLBs contain a variety of different logic functions, such as look-up tables (LUTs), registers, multiplexers (MUX), and the like. CLBs are incorporated into FPGAs range in complexity and size. A common aspect among CLBs is the use of look-up tables (LUT) for implementing logic functions. Additionally, CLBs often consist of multiple LUTs along with programmability allowing LUTs to be connected together within the CLB. These CLBs can be as simple as 2-input NAND gates, or can have a complex structure such as multiplexers or look-up tables. Most logic blocks also contain some type of flip- flop, to aid in the implementation of sequential circuits. Configurable Logic Blocks Block RAMs Block RAMs I/O Blocks Block RAMs Figure 2: General Architecture of FPGA

21 8 The IOBs contain circuitry that facilitates the transfer of signals to and from input/output (IO) pads of the FPGA. The IOBs are configured to the inter-connect network which includes vertical and horizontal inter-connect channels comprising adjacent inter-connect lines. An IOB allows signals to be driven offchip or optionally brought onto the FPGA onto interconnect segments. The IOB can typically perform other functions, such as tri-stating outputs and registering incoming or out-going signals. The I/O blocks provide the interface between the external pins of the IC package and the internal signals lines, including the programmable interconnects. The programmable interconnect generally connects a single output of a CLB to an input of another CLB, CLBs to the wire segments, or one wire segment to another. A interconnect is comprised of metal wires and transistors that act as pass gates and signal buffers that preserve the signal integrity. The wire segments, along with programmable switches are together connected as routing architecture. Similar to the logic blocks, these switches can be designed in many ways, varying in number of connections between blocks. These designs directly affect the complexity of routing process. The programmable switches can be constructed by: pan-transistors controlled by static RAM cells, antifuses, EPROM transistors and EEPROM transistors. The interconnection can be a symmetrical array, row based, hierarchical, or sea of gates.

22 FPGAs: A High-Performance Complement to Microprocessor Programmable Logic Devices offer a cost effective alternative to custom processors due to their generic nature with the added benefits of short time-to-market, no NRE costs, off-the-shelf availability, ability to control inventory in peak and trough times, and ability to reduce total cost of ownership over the lifetime of an end product. Figure 3: Xilinx CLB's Structure Microprocessor obsolescence is a major concern for many companies. Programmable logic can provide a viable solution to this problem. By using soft core microprocessors

23 10 embedded within a programmable logic device, not only one can own the processor core for use in any future devices and platforms, but the design can be both flexible and scalable to suit different platforms [1]. Figure 4: Xilinx CLB Slice's Structure The advantage of FPGA over a processor is due to its find-grain parallelism, the advantage over an application-specific integrated circuit (ASIC) is FPGAs

24 11 reconfigurability. This later characteristic allows the FPGA device to be configured to carry out a specific function, and to be reconfigured to carry out a different function at a later time. FPGAs can be reconfigured on-the-fly during in-field use, as each algorithm is implemented as a separate configuration file. Algorithms can run at very high speeds because each of them is separately optimized without constraints imposed by any other algorithm. The clock speed can also be controlled according to the performance of each algorithm, independent of any other algorithm. Similarly, the entire on-chip logical resources of the FPGA can be devoted to maximizing the packing density of a particular algorithm's processing cell independent of other algorithms. FPGAs do not force a rigid predefined limit on the number of cells per chip. Switching FPGA wiring between algorithms is brief, and is transparent to the user. Each algorithm's processing element is as small as possible, maximizing the opportunity for parallelism FPGA Vendors and Families There are a number of FPGA vendors in the market, such as Xilinx, Altera, Actel, etc. Their products are classified into two categories: SRAM-based FPGAs o Xilinx, Inc o Altera Corp. o Atmel o Lattice Semiconductor

25 12 Flash & antifuses FPGAs o Actel Corp. o Quick Logic Corp. Each company produces different lines of devices, from low-cost to high performance. Xilinx and Altera are the most popular FPGA vendors. They capture about 90% market share. Xilinx FPGAs include Old families o XC3000, XC4000, XC5200 (old 0.5µm, 0.35µm and 0.25µm technology. Not recommended for modern designs) Low-cost families o Spartan/XL derived from XC4000 o Spartan-II derived from Virtex o Spartan-IIE derived from Virtex-E o Spartan-3 High-performance families o Virtex (0.22µm) o Virtex-E, Virtex-EM (0.18µm) o Virtex-II, Virtex-II PRO (0.13µm) o Virtex-4 (0.09µm)

26 13 Altera FPGAs include Low-Cost FPGAs o Cyclone III (0.065µm) o Cyclone II (0.09µm) o Cyclone (0.13µm) High-End FPGAs o Stratix III (0.065µm) o Stratix II GX (0.09µm) o Stratix II (0.09µm) FPGA Boards There are a number of FPGA boards produced by different vendors in the market such as Xilinx (Spartan 3E starter kit, HW-AFX-FG , HW-V2P-ML323, etc), Xess (XSA-50, XSA-3S1000, XSB-300E, etc), Celoxica (RC10, RC300, RC2000, etc), and so on. The main usages of FPGA board are for prototyping and education. They ease the manufacture of products, in a way that allows developers to concentrate at the system level, knowing that the intricacies of the system parts have been fully engineered to a competent level. FPGA boards can have a good amount of add-ons, which can help speed up the development process. Those add-ons include, but are not limited to, microcontroller, memory, oscillator, a parallel port, PS/2 port for a mouse or keyboard, A/D converter, stereo input/output, and a VGA monitor port.

27 / / / / / / / Reconfigurable Computers SRC Architecture The SRC-6E system architecture front/end consists of two dual Intel processor motherboards. Each motherboard contains two Intel P4 2.8 GHz Xeon processors and one gigabyte of double-data rate (DDR200) DRAM memory. Each system hosts a multiprocessor version of the Linux operating system, and provides two distinct Linuxbased microprocessor-fpga reconfigurable computers [2]. P4 (2.8GHz ) MB/s P4 (2.8GHz ) MB/s Control FPGA XC2V6000 ½ MAP Board 1064 MB/s 1064 MB/s L2 PCI-X up Board 4256 MB/s MIOC L2 Computer Memory (8 GB) DDR Interface SNAP 4256 MB/ s 2128 MB/s FPGA 1 XC2V MB/s / (6x64 bits) On-Board Memory (24 MB) 4800 MB/s (6x 64 bits) / 2400 MB/s (192 bits) (108 bits) / 4800 MB/s (6x 64 bits) / (108 bits) FPGA 2 XC2V6000 Chain Ports 2400 MB/s Figure 5: SRC Hardware Architecture

28 15 Each MAP coprocessor consists of two Xilinx Virtex II 6000 FPGA chips available for user logic, and a control processor which is also a Xilinx Virtex II 6000 FPGA, all at a speed grade four, rating and running at a clock rate of 100 MHz. The control processor implements fixed control logic, and handles data transfers between Intel system memory and the onboard memory of the MAP processor. Figure 6: MAP's Interface Architecture Block Diagram

29 Communications The MAP control processor communicates with the Intel processors through a SNAP interconnect. The SNAP interconnect is a high speed, low latency interface which functions as a memory interface, and plugs into a DRAM slot on the motherboard. SNAP s effective data transfer into the MAP control FPGA is 1.6 GB/s. This high speed data transfer is due to a DMA s read from common memory before write to the SNAP interface. Since there is one control bit for every byte of data transferred, the maximum payload bandwidth is 1,422 MB/s. Payload bandwidths are further reduced to 1,415 MB/s for SNAP writes and 1,280 MB/s for SNAP reads due to SRC microprocessor cache flushing requirements. These SNAP transfer rates provide a significant throughput advantage for SRC s SNAP versus component interfacing using the PCI-X bus. SNAP Out Cable Connector SNAP In Cable Connector Packetizer Write FIFO Control Chip De-Packetizer Read FIFO DIM Slot Figure 7: SNAP Card

17 The block diagram of a SNAP card is shown in Figure 7. 2.2.2 - Cray XD1 The Cray XD1 system is based on the Direct Connected Processor architecture (Figure 8).

30 17 The block diagram of a SNAP card is shown in Figure Cray XD1 The Cray XD1 system is based on the Direct Connected Processor architecture (Figure 8). It combines many processors into a single system to deliver a high level of application performance. This implementation links processors to each other using a high performance interconnect fabric, eliminating shared memory contention and PCI bus bottlenecks. The Cray XD1 supercomputer is designed to deliver exceptional HPC application performance. It is very reliable and scalable to hundreds of compute notes due to its high bandwidth and low latency. Figure 8: Architecture of Cray XD1 Machine

31 18 Cray XD1 technical specifications are shown in Table 1. For more details, please contact Cray, The SuperComputer Company. Table 1: Cray XD1 technical specifications CPU bit AMD Opteron 200 series single or dual core processors per chassis Cache 64K L1 instruction cache, 64K L1 data cache, 1 MB L2 cache per core FLOPS 106 GFLOPS theoretical peak performance (@ 2.2 GHz dual core) SMP 2-way (single core) or 4-way (dual core) nodes Main Memory GB PC3200 (DDR 400) Registered ECC SDRAM per chassis or 96 GB PC2700 (DDR333) Registered ECC SDRAM per chassis (1-8 GB per socket) Memory Bandwidth Interconnect Application Acceleration (FPGA) External I/O Options 12.8 GB/s per node 2 or 4 Cray RapidArray links per node (4 or 8 GB/s per node) Fully nonblocking Cray RapidArray switch fabric (48 or 96 GB/s) 12 or 24 external Cray RapidArray interchassis links - 24 or 48 GB/s aggregate 1.7 µs latency between nodes Direct Connect or Fat Tree Topology 6 Xilinx Virtex-4 FPGAs, XC4VLX160-10, 16 MB QDR RAM, 3.2 GB/s interconnect 4 PCI-X bus slots Single, Dual, and Quad Port Gigabit Ethernet PCI-X cards Single Port 10 Gigabit Ethernet PCI-X card Single and Dual Port Fibre Channel HBAs Single Port Infiniband HCA Disk Up to six 3.5 inch Serial ATA drives (74 GB 10K RPM or 250 GB 7200 RPM) System Administration Graphical and command line system administration toolsets for partition management, fault management, configuration, security, software updating, telemetry, and provisioning across all chassis in a system

32 19 Partitioning of system into multiple logical computers Administration of entire partition as a single entity (single system command and control) Transaction processing used to ensure configuration consistency Workload management - Grid Engine, PBS Pro Automatic interconnect topology verification and autoconfiguration of L2 and IP networking Automatic response to component failures: isolation of hard failures, re-initialization on soft failures, switching around redundant components Automatic restart of jobs following system failure Reliability Management processor, network, over 200 measurement points on each Cray XD1 chassis Independent 100 Mb/s management fabric within and between Cray XD1 chassis Thermal stability maintained through temperature monitoring and regulation of fan speed Proactive detection of impending fan failure and automatic isolation of affected components Hot swappable fans and disk blades Operating System Cray HPC-enhanced Linux, Kernel version File Systems NFS v2/3, ReiserFS, Lustre Global Parallel File System External Storage Fibre Channel disk controllers and drive enclosures Parallel MPI 1.2 MPI-IO, Sockets Direct Protocol (SDP) for TCP/IP acceleration Processing Shared Memory Shmem, OpenMP, Global Arrays Access Compilers Power Dimensions Fortran 77, 90, 95, HPF; C/C++; Java 2200 W typical. 3 phase: Circuit requirement: 20A, 208 V per chassis Single Phase: Circuit requirement: 30A, 200 V to 240 V per chassis 3 VU (5.25") x 23" W x 36" D per chassis (13.3 cm high x 58.4 cm wide x 91.4 cm deep), 12 chassis per cabinet

33 Chapter 3 - Reconfigurable Computing Languages and Programming Environments Traditional Design Flow HDLs for Programming FPGAs FPGAs have traditionally been programmed by hardware engineers using a Hardware Design Language (HDL). The two principal languages being used are Verilog and VHDL. They were originally developed as modeling languages and tools for direct implementation from the Register Transfer Level (RTL). HDL was a big improvement over schematic circuit design. The design process for FPGAs in HDLs is conceptualized in Figure 9. The input of the process is the specification of the design. Developers then describe that specification in HDLs. A functional simulation will be performed at this point to ensure that the design is functioning properly. Upon the completion of functional simulation, the HDLs source codes are synthesized using a synthesis tool such as Synplify Pro. A post-synthesis simulation is conducted to make sure that there is no error during the translating process. A netlist file, which is generated by the synthesis tool, will be used by the 20

34 21 implementation tools (such as Xilinx) to perform mapping, placing and routing. A bitstream, which is the output of the implementation process, will be used to configure the targeted FPGA device. Specification HDL description HDL source code Functional Simulation Synthesis Netlists Bitstream Implementation (Mapping, Placing & Routing Post-synthesis Simulation Timing Simulation Configuration On-chip Testing Figure 9: Design Flow for FPGAs in HDLs

35 HLLs for Programming Microprocessors The software development process consists of the following step: Software elements analysis: to extract the requirements. This is the most important task in the process. Specification: to describe the software to be written. Specification is the most important internal interface and should be consistent. Architecture: an abstract representation of the system to make sure it will meet all the requirements. Implementation: translate the design into HLL codes. Testing: to ensure that the final product is bug-free, and meets all the customer s requirements. The entire software development process is described in Figure Application Programming Interfaces (APIs) between FPGAs and Microprocessors An API (application programming interface) is a layer that allows developers to work at a higher level of design abstraction. These API functions act as the interface between the programming software and the actual hardware. It hides the underlying detail, and draws parallels with the development support and hardware independence provided by modern microprocessors operating system environments. API allows consistent

36 23 interfacing between hardware applications and software applications. It simplifies the development process and improves application portability. Software elements analysis Extract the requirements Customer Precisely describe the software Specification Abstract representation Architecture Coding Implementation Test all parts of the software Testing Figure 10: Software Development Process Host Computer Main Program APIs FPGA Boards Figure 11: Communication between Main program in host computer and FPGA boards using APIs

37 24 Figure 11 shows the communication between a main program in the host computer and the FPGA board. As described in the figure, the APIs take care of the communication task, providing input data to FPGA board from the host computer, and transferring output data from FPGA board back to the host computer. There is no common standard for APIs available, thus FPGA vendors develop their own API standard. This problem leads to a targeted hardware dependent code and a poor portability between vendors SRC Carte C Introduction Carte-C is a part of SRC Carte programming environment. It is a subset of ANSI-C, thus very easy to learn and use. Carte-C is developed for software developers, mathematicians and scientists, who want to employ the power of hardware with the flexibility of software design. It also facilitates hardware/software co-design, and shortens development time, which encourages and complex optimizations. The relationship between microprocessor sources, MAP processor sources, MAP/user macros, and their execution structure within a MAP FPGA is described in Figure 12. On the MAP, FPGA designs can be implemented in VHDL, in C or FORTRAN, or any combination of both. A simple API is available in the high-level language (HLL) to handle data transfer functions between the Intel and MAP processors and the control of data input and output to included user HDL designs.

38 25 SRC also supports the use of library macros to speed up the development process. A macro is an HDL source that is targeted for the MAP processor. Either VHDL or Verilog is allowed to use to write macro. A macro s primary characteristics are defined as functional, stateful, or external. Additional characteristics define latency of the design and whether the design is pipelined. However, library can be developed in either HDL or HLL. Developers can make use of pre-defined library to speed up the development time, and the efficiency of the code since they have been tested and optimized. Microprocessor function_1 FPGA Libraries of macros main.c macro_1(a, b, c) macro_2(b, d) macro_1 macro_2 macro_3 macro_4 function_1() macro_2(c, e) VHDL function_2() FPGA function_2 I/O macro_3(s, t) a macro_1(n, b) b macro_1 c ANSI-C macro_4(t, k) Carte-C (Subset of ANSI-C) macro_2 d I/O macro_2 e Figure 12: SRC Programming Model

39 26 If a C or FORTRAN source is used alone or in combination with HDL source to implement the design within the MAP processor, the compiler attempts to extract the maximum parallelism within an active loop construct, and generate pipelined hardware logic for instantiation in the MAP FPGAs. LLL (ASM) HLL (C, Fortran) HLL (C, Fortran) µp system FPGA system HDL (VHDL, Verilog Library Developer HLL (C, Fortran) HLL (C, Fortran) Application Developer Figure 13: Library Development in SRC Language Code Development In order to fully take advantage of the MAP hardware, it is essential for the user to make some modifications to the application. As the compiler technology matures and more optimizations can be automatically performed that target the unique characteristics of MAP hardware, it will become less significant for the users to understand the behavior

40 27 of the hardware and modify their code accordingly. These modifications include partitioning the code for MAP execution into new files, restructuring the partitioned code, inserting MAP resource management library calls, and inserting calls to the function compiled for MAP execution in the code that executes on the Intel processor [3]. Partitioning the code: Good code partitioning approach improves overall performance. Another potential improvement could be in the process of manipulating single bits from within a long bit-stream of data. Inlining functionality: Inlining allows the user to structure their MAP code into a hierarchy of function calls. Another benefit of inlining is that after the inlining phase has completed, the entire code body is available to the compiler for optimization, allowing optimization across function calls and eliminating function call overhead. Modifications to the MAP functions: All data must be passed as parameters to the function, including an argument that indicates the of the MAP that this function is to be executed on. The parameter is the last parameter in the formal parameter list. Modifications to the calling functions: These modifications include deciding where the data will reside in memory, aligning data arrays passed as actual

41 28 parameters to the MAP function, allocating and releasing MAP resources, providing a prototype of the function, and calling the MAP function. Using two user logic chips: If a MAP routine cannot fit into one user chip, the second chip can be used to effectively double the amount of available logic and BRAM space. In this mode the two chips are programmed with two separate routines, that will execute concurrently and that can communicate and synchronize with each other. Note that only the primary one can communicate with the main program in the microprocessor side, the secondary one will need to communicate via the primary one. Parallel code blocks: The compiler can identify and exploit much parallelism within each code block, especially within a pipelined loop. However, performance may benefit further when different sections of the original code execute in parallel independently. A parallel region contains multiple parallel sections as illustrated in Figure 14. When normal execution reaches a parallel region, a FORK code block is initiated and execution simultaneously starts at the beginning of all of the parallel sections. Thus there is an active code block in each of the parallel sections until that parallel section completes (reaches the special JOIN code block). Only when all parallel sections are complete does execution continue from the JOIN code block.

42 29 FORK Parallel Section #1 Parallel Section #2 Parallel Section #N JOIN Figure 14: SRC Parallel Region Code Optimization Since the MAP's user logic is clocked at 100, achieving good performance hinges on parallel execution of functional units. In the current compilation environment, parallelism comes from three sources. First, code blocks can be executed in parallel. Second, operations within code blocks that have no data dependencies can execute concurrently. Third, perform pipeline parallelism from inner loops. Loop performance: There are four potential sources of loop slowdown: loopcarried scalar dependencies, loop carried memory dependencies, multiple

43 30 accesses to a memory unit (OBM or BRAM), and periodic input macro calls. Use SRC macros or temporary variables to avoid the problems. Replacement of If-Then-Else Statement: If-then-else statements may result in in-efficient code generation. In some cases these statements can be eliminated by using SRC macros and. In pipelined inner loops the compiler automatically transforms the code to a form that uses selector macros. Thus, the programmer only needs to manually insert selector macros in code outside of pipelined inner loops such as: straight-line code, outer loops, and loops that cannot be pipelined User's Callable Macros The MAP compiler translates the source code's various basic operations into macro instantiations. (A macro is a piece of hardware logic designed to implement a certain function. It may originate from Verilog, VHDL, schematic capture, etc., but is typically represented in the form of an EDIF *.edn file.). Since users often wish to extend the built-in set of operators, the compiler allows users to integrate their own macros into the compilation process. A macro is invoked from within the C void function by means of a void function call. This call's arguments must be specified so that all incoming values precede all outgoing values. Macros can be categorized by various criteria, and the compiler treats them in different ways based on their characteristics. Five types of user macros can be used by the MAP

44 31 compiler: Pure Functional, Pure Functional Periodic, Stateful, Stateful Periodic, and external. The chart in Table 2 below shows their characteristics. Table 2: SRC Macro's Characteristics Chart Stateful External Latency Fully pipelined Periodic Pure Functional No No Fixed Yes No Pure Functional Periodic No No Fixed No Yes Stateful Yes No Fixed Yes No Stateful Periodic Yes No Fixed No Yes External Yes or No Yes Variable N/A N/A Application sources.c or.f files.mc or.mf files µp Compiler MAP Compiler HDL sources.v files Macro sources.vhd or.v files Logic synthesis Object files.o files.o files Netlists.ngo files Place & Route Linker.bin files Application executable Configuration bitsreams Figure 15: SRC Development Process

45 32 The compilation process consists of three parts: compilation of user logic HDL files to be executed on the MAP processor, compilation of the C or FORTRAN code to be executed on the MAP processor, and the compilation of the C or FORTRAN code to be executed on the Intel processor. The binary files produced by each compilation process are combined into a single executable file. The MAP-C compiler translates functions that have been modified for MAP execution into re-locatable object files. The translation process has several steps, each performed by a distinct component of the MAP-C compiler [3]. The flow of the MAP compilation process is conceptualized in Figure 16. RC Compilation System Optimization HLL Source FORTRAN & C DFG Generation Logic Partitioning Verilog Generation Synthesis Place & Route Application Executable MAP Macros Customer Macros Runtime Macros Figure 16: MAP Compilation Process

46 33 is the command that controls the different components of the MAP compilation process for C codes. The input is a file consisting of C functions that have been modified to execute on a MAP. The resulting output files depend on the options specified in the command line. These options control the execution of the compiler components, which produce the object files required to support the three execution modes: hardware, debug, and simulation Celoxica Handel-C Introduction Handel-C is a programming language designed to facilitate the compilation of programs into synchronous hardware. It is not a hardware description language though; rather, it is a programming language aimed at compiling high level algorithms directly into gate level hardware [4]. Handel-C s level of design abstraction is above RTL but below behavioral level, allows the designer to focus more on the specification of an algorithm rather than adopting a structural approach to coding. For these reasons the Handel-C is used for implementation of ECM Language Handel-C is a subset of ANSI-C with the necessary constructs added for hardware design. Comparison of ANSI-C and Handel-C is given in Figure 17. The Handel-C syntax is based on that of conventional C so programmers familiar with conventional C will recognize

47 34 almost all the constructs in the Handel-C language. Since Handel-C is based on the syntax of conventional C, programs written in Handel-C are implicitly sequential. Writing one command after another indicates that those instructions should be executed in that exact order. Handel-C provides constructs to control the flow of a program. Algorithms can be expressed in Handel-C without worrying about how the underlying computation engine works. This philosophy makes Handel-C a programming language rather than a hardware description language. In some senses, Handel-C is to hardware what a conventional high-level language is to microprocessor assembly language. Side Effects ie. ANSI-C Standard Library Arrays Preprocessors ie. Arithmetic operators ANSI-C Constructs Handel-C Standard Library Parallelism Arbitrary width variables RAM, ROM Channels Signals Resursion Floating Point Pointers Structures Bitwise operators Functions Interface Enhanced bit manipulation ANSI-C Handel-C Figure 17: Comparison between Handel-C and ANSI-C

48 35 Code partitioning can be carried after successfully verifying the functionality of the system at the high level (C/C++). At this point, individual function (module) can be chosen to implement in Handel-C. Simulation can be done within the completely specified system model, and it can help developers consider block-based partitioning decisions. Testbenches can be built in C/C++, incorporating input from and output to other tools. The same testbenches can be used for the entire developing process. Handel-C program can be built as a stand-alone standard executable, which runs outside the DK environment, to be included in larger systems or distribution. Like SRC Carte, Handel-C supports hardware macro written in VHDL or Verilog. The Handel-C uses an construct to communicate with the HDL or EDIF, but the connections are slightly different in these two cases, since buses are expanded into single signals in EDIF. Handel-C identifies the HDL/EDIF component it must connect to by using the component's HDL/EDIF name as the Handel-C _. (For VHDL, if the ports generated by the Handel-C are of a different type to those used in VHDL, a wrapper file is needed to connect the two types of ports together). The hardware design that DK produces is generated directly from the Handel-C source program. There is no intermediate interpreting layer as exists in assembly language when targeting general-purpose microprocessors. The logic gates that make up the final Handel-C circuit are the assembly instructions of the Handel-C system.

49 Code Development Similarly to other High Level Languages, the target of the Handel-C compiler is low-level hardware. Thus, the major advantages are made possible by the use of parallelism. Although programs written in Handel-C are sequential, it is possible to construct the compiler to build hardware to execute statements in parallel. Handel-C parallelism is true parallelism - it is not the time-sliced parallelism familiar from general purpose computers. In other words, when instructed to execute two instructions in parallel, those two instructions will be executed at exactly the same instant in time by two separate pieces of hardware. When a parallel block in encountered, execution flow splits at the start of the parallel block and each branch of the block executes simultaneously. Execution flow then re-joins at the end of the block when all branches have completed. Any branches that complete early are forced to wait for the slowest branch before continuing as shown in Figure 18. Within parallel blocks of code, sequential branches can be added by using a code block denoted with the brackets instead of a single statement. For example: { { } }

50 37 Statement Parallel Block Figure 18: Branching and Re-joining of the Execution Flow In this example, the first branch of the parallel statement executes the assignment to while the second branch sequentially executes the assignments to and. The assignments to and occur in the same clock cycle; the assignment to occurs in the next clock cycle. Using this parallel statement, large blocks of functionality can be generated that execute in parallel. It should be noted that variable cannot be written multiple times in same clock cycle.

51 Efficient FPGA Resource Usage For efficient use of hardware, Handel-C provides the flexibility of use of user defined data types of variable sizes. In Handel-C, integers are not limited to 64 bits; in fact, Handel-C types are not limited to specific widths. When a variable is declared, the minimum width should be specified to minimize hardware usage, i.e., to declare a 5-bit integer Code Optimization Handel-C is a C based hardware description language. In Handel-C registers are implemented using flip- flops and all other circuitry is made up of logic gates. Each of the logic gates in the circuit has delay associated with it as the inputs propagates through the outputs. Optimization is the main part while modeling hardware to reduce the propagation delay and to exploit parallelism and pipelining. A Handel-C program executes with one clock source for each main statement. It is important to be aware exactly which parts of the code execute on which clock cycles. This is not only important for writing code that executes in fewer clock cycles, but may mean the difference between correct and incorrect coding when using Handel-C s parallelism. The following techniques may be utilized when optimizing the code: Reducing logic depth: Deep logic results in long path delays in the final circuits, therefore reducing logic depth should increase clock speed. Division and modulo

52 39 operators produce the deepest logic. Most common division and multiplications can be done with the shift operators. Most common modulo operations can be done with the operator. Wide adders should be broken up using more clock cycles with shorter adders. Greater than and less than comparisons should be avoided since they produce deep logic. Complex expressions should be reduced into a number of stages. Pipelining: A pipeline circuit takes more than one clock cycle to calculate any result but can produce one result every clock cycle. The trade-off is an increase in latency for a higher throughput so pipelining is only effective if there is a large amount of data to be processed. Parallelism: is the main reason why application on hardware can run much faster than in software even though the hardware run at a much slower clock speed. One should analyze the potential parallelism of a program in order to be able to break it into different non-conflicting parallel blocks, which operate at the same clock cycle to archive speed up Some Restrictions When Using Handel-C and FPGAs There are a number of resemblances between the Handel-C and ANSI-C. Handel-C has a few extensions to ANSI-C, which allows additional functionality for hardware design. However, Handel-C is a language digital logic design, which means that the way in which DK interprets it may differ to the way in which compilers interpret ANSI-C for software

53 40 design. It also lacks some ANSI-C constructs which are not appropriate to hardware implementation Environment In contrast to SRC Carte, design in Handel-C is quite simple. Figure 19 describes the design flow in Handel-C. The current programming environment for Handel-C is Development Kit DK-4. DK is the Integrated Development Environment (IDE) for Handel- C and supports a mixed abstraction level modeling and simulation environment. DK support Handel-C, C and C++ source code, in which C/C++ functions are used for simulation only. There is a clear boundary between the parts of design used for simulation and parts used for hardware implementation. Since Handel-C is based on the syntax of conventional C, programs written in Handel-C are implicitly sequential. Writing one command after another indicates that those instructions should be executed in that exact order. To execute instructions in parallel, the keyword must be used. The hardware design that DK produces is generated directly from the Handel-C source program. There is no intermediate interpreting layer as exists in assembly language when targeting general-purpose microprocessors. The logic gates that make up the final Handel-C circuit are the assembly instructions of the Handel-C system.

54 41 The overall program structure comprises of one or more associated with a clock. This is unlike conventional C, where only one functions, each function is allowed. More than one function should only be used if different speeds (and so different clocks) needed. Executable Specification Handel-C VHDL Synthesis EDIF EDIF Place & Route Figure 19: Design Flow in Handel-C

55 Chapter 4 - Elliptic Curve Method of Factoring RSA Public Key Cryptosystem There are two kinds of cryptographic systems: Secret Key and Public Key. A Secret key system involves the use of one key, called Secret Key. A Public key system involves the use of two keys, named Public Key and Private Key. plaintext ciphertext Shared key ciphertext plaintext Figure 20: Operation of Secret-Key Cryptosystem In secret key cryptographic system, only a single key is used. Given a message (plaintext), and the key, encryption produces a cipher text (unintelligible data), which is 42

56 43 about the same length as the plaintext. Decryption is the reverse of encryption, whereby the cipher text and the same key are used to produce the plaintext. Secret key cryptosystem is sometimes referred as conventional or symmetric cryptosystem (Figure 20). The main challenge in this system is to get the sender and receiver to agree on a secret key without anyone knowing the key. Anyone who finds out the secret key can read, modify and forge all messages encrypted using that key. The generation, transmission and storage of keys are called key management. Secret key cryptosystem often has difficulty providing secure key management. Public key cryptosystem is sometimes referred as asymmetric cryptosystem, invented in 1976 by Whitfield Diffie and Martin Hellman [5]. Unlike secret key cryptosystem, keys are not shared. Instead, each individual has two keys: a private key that is kept secretly, and a public key that is known to the world. The private and public keys are mathematically linked to each other. This system functions as one-way trapdoor and is described in Figure 21. RSA is named after its inventors, Rivest, Shamir, and Adleman. It is the most widely used public key cryptosystem that does encryption as well as decryption. The key length is variable depending on application. RSA s key can be chosen long for enhanced security, or short for efficiency. The most commonly used key length for RSA is 1024 bits.

57 44 In an RSA system, there are a public key and a corresponding private key. Keys are generated as follows: X Public Key Y Private Key Figure 21: Operation of Public-Key Cryptosystem Choose two large primes and, multiply them together to get. Choose a number that is relatively prime to. Public Key will be. Find the number d that is the multiplicative inverse of. That is. Private Key will be. Given a message, cipher text is calculated as. Given a cipher text, message is recovered by. This forms the trap-door one-way function as in Figure 22.

58 45 m Public Key (e, n) c Private Key (d, n) Figure 22: Trap-door One-way Function of RSA RSA Security The real premise behind RSA s security is the assumption that factoring a big number is difficult or infeasible. If can be factored to get and, then the security of RSA is compromised. Therefore the cost and time required to factor an -bit RSA modulus provide an upper bound on the security of -bit RSA. Currently the size of used these days is 1024 bits or 309 digits Factorization Algorithms for integer factorization can be split into two groups, the special and the general algorithms. The special algorithms are for a special class of numbers, that is the factoring time depends on the size or special features of the factors. The special algorithms include,, and. The general algorithms are not constructed for any special class of numbers to be factored, that is the factoring time depends on the size of the

59 46 number. Quadratic Sieve and Number Field Sieve are two examples of the general algorithms Special Algorithm: Elliptic Curve Method (ECM) ECM: Overview Let be a field with characteristics different from 2, 3. For example, with a prime, which is a set of integers with addition and multiplication. An elliptic curve over is defined as a set of points satisfying Eq. 1 together with a special point called the point at infinity and denoted. Two points and can be added together to give a third point, where and for some - rational functions and. The point at infinity,, is an identity element of this operation, i.e.. Points of the curve (including the point at infinity), together with aforementioned addition form a group, which is denoted by. The representation of elliptic curve points using two coordinates is called the affine representation. In order to increase the computational efficiency of point addition, one may prefer the representation of in homogeneous (projective) coordinates of,

60 47 Eq. 2 With this change, with represents in affine coordinates. If then is the point at infinity, which is represented by in projective coordinates. Montgomery [6] studied elliptic curves of the form,, to increase the speed of elliptic curve operations in software and hardware. This form is obtained by the change of variables,, from Eq. 1. The corresponding expression in projective coordinates is Eq. 3 with. Using the above form of elliptic curves, Montgomery derived an addition formula for P and Q which does not need any y-coordinate information, assuming that the difference is already known. The choice of parameters a and b for the given above curve can be simplified using Suyama s parameterization, which expresses and coordinates of a point on the curve, as a function of a single parameter, as described in detail in [7]. Let be a composite integer to be factored. The ECM Method [6] [7] [8] considers elliptic curves in Montgomery form, given in Eq. 3, and involves elliptic curve operations, where the elements in are reduced. Since is not a

61 48 prime, over is not really an elliptic curve but we can still do point additions and doublings as if was a field ECM: Algorithm The Elliptic Curve Method (ECM) was originally proposed by Lenstra [9] and subsequently extended by Brent [8] and Montgomery [6]. The original part of the algorithm proposed by Lenstra is typically referred to as Phase 1 (or Stage 1), and the extension by Brent and Montgomery is called Phase 2 (or Stage 2). The pseudo-code of both phases is given below as Algorithm 1. Algorithm 1: ECM Algorithm Require: : composite number to be factored, E: elliptic curve, : initial point, : bounds for Phase 1 and Phase 2 respectively,. Ensure: : factor of,, or FAIL Phase If then 5. Return q 6. Else 7. Go to Phase 2

62 49 8. End if Phase For each prime do End for If then 16. Return q 17. Else 18. Return FAIL 19. End if Let be an unknown factor of. Then the order of the curve, i.e., the number of points on the curve with operations performed, might be a smooth number that divides. In that case, we have for some. For any point belonging to the curve, therefore. Thus,, and the unknown factor of, can be recovered by taking. Montgomery [6] [10] and Brent [8] independently suggested a continuation of Phase 1, if one has. Their ideas utilize that fact that even if one has, the value of k might miss just one large prime divisor of. In that case, one only

63 50 needs to compute the scalar multiplication by to get. A second bound, restricts the size of possible values of. Let be the cost of one multiplication. Then Phase 1 of ECM finds a factor of with the conjectured time complexity [9]. Phase 2 speeds up Lenstra s original method by the factor which is absorbed in the term of the complexity, but is significant for small and medium size factors Operations on an Elliptic Curve The hierarchy of major operations used in the ECM algorithm is shown in Figure 23. Scalar multiplication,, is a basic elliptic curve operation used in ECM. It is also a fundamental operation of a majority of Elliptic Curve Cryptosystems [11], and therefore it has been studied extensively in the past from the point of view of efficient implementations in software and hardware Top Level: Scalar Multiplication Scalar multiplication is defined as an addition,, where is an integer and is a point on an elliptic curve. An efficient algorithm for computing scalar multiplication was proposed by Montgomery [6] in 1987, and is known as the Montgomery Ladder Algorithm. This algorithm is applicable to any curve, and is independent of the point representation, i.e., it can be

64 51 executed in both affine and projective coordinates. However, it is especially useful when an elliptic curve is expressed in Montgomery form (see Eq. 3), in projective coordinates. In this case, all intermediate computations can be executed using only and coordinates, and the -coordinate of the result can be retrieved (if needed) from the and coordinates of the final point. In the ECM method, the -coordinate of the result is not needed, so this final computation is unnecessary. As a result, we denote the starting point by, intermediate points, by, and the final point by. The pseudocode of the Montgomery Ladder Algorithm is shown below as Algorithm 2. Algorithm 2: Montgomery Ladder Algorithm Require: on with, an -bit positive integer with Ensure: For downto 0 do 3. If then Else End if 8. End for 9. Return

65 52 ECM Top Level Scalar Multiplication Medium Level Point Addition Point Doubling Low Level Modular Multiplication Modular Addition Modular Subtraction Figure 23: Hierarchy of Elliptic Curve Operations Medium Level: Point Addition and Doubling Montgomery ladder algorithm s two basic steps, Point Addition and Doubling, are defined in detail as Algorithm 3. The algorithm is constructed in such a way that the difference between the intermediate points and,, is always constant, and equal to the value of the initial point, as shown graphically in Figure 24. Therefore, and in the formulas in can be replaced by and respectively. A careful analysis of formulas in Algorithm 3 indicates that point addition requires 6

66 53 multiplications, and point doubling 5 multiplications. Therefore, a total of 11 multiplications are required in each step of the Montgomery ladder algorithm. In Phase 1 of ECM, the initial point,, can be chosen arbitrarily. Choosing implies throughout the entire algorithm, and reduces the total number of multiplications from 11 to 10 per one step of the algorithm, independent of the -th bit of. This optimization is not possible in Phase 2, where the initial point is the result of computations in Phase 1, and thus cannot be chosen arbitrarily. Algorithm 3: Addition and Doubling using the Montgomery s Form of Elliptic Curve Require: with, where is a parameter of the curve in Eq. 3 Ensure: Low Level: Montgomery Multiplication Let be an odd integer. In many cryptosystems such as RSA, computing is a crucial operation. Taking the reduction of is a more time consuming step than the multiplication without reduction. Montgomery [12]

67 54 introduced a method for calculating products without the costly reduction, known as Montgomery multiplication. Montgomery multiplication of and,, is defined as for some fixed integer. P, 2P 2P, 3P 3P, 4P 4P, 5P 5P, 6P 6P, 7P 7P, 8P 8P, 9P 9P, 10P 10P, 11P 11P, 12P 12P, 13P 13P, 14P 14P, 15P 15P, 16P Figure 24: Graphical illustration of Montgomery Ladder Algorithm, for the case of k = 11 Since Montgomery multiplication is not an ordinary multiplication, there is a process of conversion between the ordinary domain (with ordinary multiplication) and the Montgomery domain. The conversion between the ordinary domain and the

68 55 Montgomery domain is given by the relation with and the corresponding diagram is shown below in Table 3. Table 3: Conversion between the ordinary domain and the Montgomery domain Ordinary Domain Montgomery Domain The table shows that the conversion is compatible with multiplication in each domain, since Eq. 4 The conversion between each domain can be done using the same Montgomery operation, in particular and, where can be pre-computed. Despite the initial conversion cost, if we do many Montgomery multiplications followed by an inverse conversion, as in RSA, we obtain an

69 56 advantage over ordinary multiplication. In fact, in the ECM method, the inverse conversion is not necessary because for an arbitrary, and odd. Algorithm 4: Radix-2 Montgomery Multiplication Require: Ensure: For to do End for 6. Return Algorithm 4 shows the pseudo-code for radix-2 Montgomery multiplication where we choose. It should be mentioned that this is slightly different from which Montgomery [12] originally used. This modified algorithm makes all the inputs and output in the same range, i.e.,. Therefore, it is possible to implement Algorithm 4 repeatedly without any reduction unlike the original algorithm [12], where one has to take reduction at the end of the algorithm to make the output value in the same range as the input values.

70 General Algorithm: Number Field Sieve (NFS) NFS: Overview The fastest known method for factoring large integers is the Number Field Sieve (NFS), invented by Pollard in 1991 [13]. It has since been improved substantially and developed from its initial special form (which was only used to factor numbers close to perfect powers, such as Fermat numbers) to a general purpose factoring algorithm. Using the Number Field Sieve, an RSA modulus of 663 bits was successfully factored by Bahr, Boehm, Franke and Kleinjung in May 2005 [14]. The cost of implementing the Number Field Sieve, and the time it takes for such an implementation to factor a b-bit RSA modulus, provide an upper bound on the security of b-bit RSA. In order to factor a big integer N such as an RSA modulus, NFS requires the factorization of a large number of moderately sized integers created during run time, perhaps of size 200 bits [15] [16]. Such numbers can be routinely factored in very little time. However, because an estimated 10 10, such factorizations are necessary for NFS to succeed in factoring a 1024 bit RSA modulus, it is of crucial importance to perform these auxiliary factorizations as fast and efficiently as possible. Even tiny improvements, once multiplied by factorizations, would make a significant difference in how big an RSA modulus can be factored.

71 ECM in NFS Let us look at the steps required in NFS in Figure 25. Those steps include Polynomial Selection, Relation Collection, Linear Algebra, and Square Root. As mentioned before, NFS requires the factorization of a large number of moderately sized integers created during run time, perhaps of size 200 bits. In reviewing existing algorithms which can be used to factor medium-size numbers, the only practically useful algorithms are probabilistic (Monte-Carlo) methods. There is no guarantee that a probabilistic algorithm will terminate successfully, but the probability of a successful outcome is large enough that the expected time needed to factor a given number is considerably lower than that of any deterministic algorithm. In particular, all known deterministic factoring methods have exponential asymptotic run time. In practice, they are, at best, used to remove the smallest prime factors from the number to be factored. Trial division by, at most, a few hundred small primes with asymptotic run-time (with covering the cost of integer arithmetic as increases) may be considered as a first step in factoring random numbers. The fastest known deterministic factoring method is due to Pollard and Strassen and has an asymptotic runtime, but is not recommended in practical applications because it is easily surpassed by simple probabilistic methods.

72 59 Polynomial Selection Relation Collection Sieving 200 & 350 bit numbers Norm Factoring ECM p-1 Method Pollard rho Linear Algebra Square Root Figure 25: Required Steps in NFS Three other probabilistic factoring methods are also of exponential run time, but with a much smaller overhead than their sub-exponential colleagues, so that within a certain range they are efficient factoring tools. These are Pollard s method, the similar method due to Williams, and Pollard s -method. Finally, the Elliptic Curve Method (ECM), which is the main subject of this thesis, is a sub-exponential factoring algorithm, with expected run time of, where, is a factor we aim to find, and denotes the cost of multiplication. ECM is the

73 60 best method to perform the kind of factorizations needed by NFS, for integers in the 200-bit range, with prime factors of up to about 40 bits [17] [18].

74 Chapter 5 - Previous Work on Implementing Applications in High Level Languages for Reconfigurable Hardware There are several previous developments related to implementing applications in HLLs for reconfigurable hardware. The first one is the implementation of SRC Libraries of Hardware Macros. Currently available libraries include: Vendor libraries of hardware macros developed and distributed by SRC Inc., including o basic integer and floating-point arithmetic o digital signal processing User libraries of hardware macros developed by GWU/GMU/USC , including o Secret-key cipher encryption & breaking o Binary Galois Field arithmetic (polynomial basis & normal basis representation) o Elliptic Curve Arithmetic o Long integer modular arithmetic (RSA) o Sorting 61

75 62 o Image processing o Bioinformatics These macros were implemented as libraries, meaning that developers can use them to speed up the development process. They are almost entirely implemented in HDL languages (VHDL and Verilog) as hardware cores. Developers have to write their HLL code to call these macros. Some other work included: N. Nguyen, K. Gaj, D. Caliga, T. El-Ghazawi, "Implementation of Elliptic Curve Cryptosystems on a Reconfigurable Computer," Proc. IEEE International Conference on Field-Programmable Technology, FPT 2003, Tokyo, Japan, Dec. 2003, pp [26], and S. Bajracharya, C. Shu, K. Gaj, T. El-Ghazawi, "Implementation of Elliptic Curve Cryptosystems over GF(2^n) in Optimal Normal Basis on a Reconfigurable Computer," 14th International Conference on Field Programmable Logic and Applications, FPL 2004, Antwerp, Belgium, Aug 30 - Sept 1, 2004, pp [25], which both contain analysis of various Carte-C / VHDL partitioning schemes for elliptic curve cryptosystems over binary fields. Vlad Kindratenko (NCSA), Accelerating Scientific Applications with Reconfigurable Computing, in which he implemented three applications (Image distance transform, Nanoscale Molecular Dynamics, and Two-Point Angular

76 63 Correlation Function) in SCR Carte-C, and moved the traditional HDL codes gradually into Carte-C to find the optimal profile. This work is probably the closest match with the work has been done in this Thesis. Viktor K. Prasanna, Gerald R. Morris, Sparse Matrix Computatons on Reconfigurable Computer (Reconfigurable Computer, Mar 2007), in which two well-known double-precision floating-point sparse matrix iterative-linearequation solvers were implemented in VHDL, and ported to SRC Carte-C as hardware cores. Esam El-Araby, Mohamed Taher, Mohamed Abouellail, Tarek El-Ghazawi, and Gregory B. Newby, Comparative Analysis of High Level Programming for Reconfigurable Computers: Methodology and Empirical Study, in which GWU group implemented DES in four different languages: Verilog, Impulse-C, DSPLogic and Mitrion-C. K Gaj, S Kwon, P Baier, P Kohlbrenner, H Le, M Khaleeluddin and R Bachimanchi, Implementing The Elliptic Curve Method of Factoring in Reconfigurable Hardware, in which the entire ECM was implemented using VHDL and ported to SCR Carte for verification. All of the applications, except for the previous work from GMU and the work of Vlad Kindratenko from NCSA, were either entirely implemented in HLL or in HDL, in contrast to the work done in this Thesis, which describe different code partitioning schemes

77 64 involving two HLLs: SRC Carte and Celoxica Handel-C, and one hardware description language, VHDL.

78 Chapter 6 - Implementation of ECM using SRC Carte-C Partitioning the Code between High Level Language (HLL) and Hardware Description Language (HDL) The first step in porting an application to a system utilizing the MAP hardware is to identify some portion of the application such that, when that portion is compiled for the MAP, overall performance will improve. Loops are often good candidates for execution on the MAP. Loop nests and their associated loop bodies, which can be pipelined, have shown execution speed-up. Once a body of code has been identified for MAP execution it must be placed in its own separate function, or split into functions for inlining, and be the first function in the file that contains these functions. SRC Program Partitioning is visualized in Figure 26. Programming in high level languages is always easier than in hardware description languages. However, the efficiency of the code is also reduced. Data partitioning is important as well. The MAP function contains the computational portion of code to be executed on the MAP. This function must also be modified to manage the data movement to and from the MAP's On-Board Memory (OBM). Data movement calls must be inserted to manage data movement between 65

79 66 either System Common Memory (SCM) or Global Common Memory (GCM) and OBM. A general architecture of ECM in SRC is conceptualized below in Figure 27. µp system FPGA system C function for µp C function for MAP VHDL macro HLL HDL Figure 26: SRC Program Partitioning Main Program in C/C++ Data Communication Module Control Unit & MEM M1 M2 A/S Figure 27: General architecture of ECM. Notation: MEM-memory; M1; M2-multipliers 1 and 2; A/Sadder/subtractor

80 67 There are three possible coding partitions. The first one is to implement Control Units, Memory and M1-M2-A/S in VHDL and build them as hardware macro; Data Communication Module is implemented in HLL (Figure 28). The second scheme is to implement M1-M2-A/S in VHDL and build them as hardware macro; Data Communication Module, Control Unit and Memory are moved into HLL (Figure 29). Main Program in C/C++ Data Communication Module Control Unit & MEM M1 M2 A/S C/C++ HLL VHDL Macro Figure 28: Coding Partition Scheme 1. Notation: MEM-memory; M1; M2-multipliers 1 and 2; A/S-adder/subtractor Main Program in C/C++ Data Communication Module Control Unit & MEM M1 M2 A/S C/C++ HLL VHDL Macro Figure 29: Coding Partition Scheme 2. Notation: MEM-memory; M1; M2-multipliers 1 and 2; A/S-adder/subtractor

81 68 The last scheme is to move everything into HLL except for the main program in C/C++. This scheme would be ideal for software engineers, who do not want to program in hardware languages (Figure 30). Main Program in C/C++ Data Communication Module Control Unit & MEM M1 M2 A/S C/C++ HLL Figure 30: Coding Partition Scheme 3. Notation: MEM-memory; M1; M2-multipliers 1 and 2; A/S-adder/subtractor Figure 31 shows four schemes to be implemented in this Thesis Coding Partition 1: Data Communication Module in Carte-C Overview In this scheme, the entire ECM unit was implemented in VHDL and built as a SRC hardware macro. The communication module was written in MAP-C. This module serves as a bridge to transfer data back and forth between the FPGA side and microprocessor side. All pre-computable data has been performed by microprocessor and transferred to

82 69 FPGA through this communication module. Once the main execution is done in FPGA, the result will be transferred back to the microprocessor side for post-computation SRC Carte-C Scheme 1 (Entire ECM module as HDL core) Scheme 2 (Control unit in Carte-C) Celoxica Handel-C Scheme 2 (Control unit in Handel-C) Scheme 3 (Entire ECM module in Handel-C) Figure 31: Four schemes to be implemented in this Thesis Main Program in C/C++ Data Communication Module Control Unit & MEM M1 M2 A/S C/C++ Carte C VHDL Macro Figure 32: SRC Scheme 1 - Data Communication Module in MAP-C. Notation: DCM - Data Communication Module

83 Architecture ECM macro was implemented entirely in VHDL. This is an external macro with variable latency. The required signals include and signal will clear all variables; provides clock signal for the entire circuit; signal will control when the macro is executed; signal will be asserted to inform the DCM upon the completion of execution. and are used to transfer data in and out respectively (Figure 33). As mentioned before, the data communication module (DCM) was implemented in MAP-C and serves as a bridge between the main program on microprocessor side, and the SRC macro in MAP. DCM takes care of moving data in and out SCM using DMA calls. DCM then calls ECM macro to input data and start the main execution. Once the main execution is done, DCM reads result from ECM macro and transfers it back to the main program in microprocessor side for post-computation. Since the current version of MAP-C does not support data streaming between MAP-C and hardware macros, data was transferred into and out of ECM macro one word at a time. This work-around results in a noticeable overhead. Data transfer between microprocessor side and hardware side also takes a long time. Due to this limitation, one should transfer all necessary data in at one time, perform the main computation as much as possible, and finally transfer the result out to reduce the overhead.

84 71 reset 1 clk start Data_in ECM UNIT 32 1 Data_out done Control 16 Figure 33: Top level of ECM unit in SRC Scheme 1 A main program was written in C to perform all pre-computations (curves parameters) and post-computation (greatest common devisor, gcd). This program was optimized using multi-threading techniques. Since the pre-computation and post-computation time are small in comparison to the main execution time of ECM in hardware, the idea is to interleave pre- and post-computations with the main time. By doing that, the precomputation and post-computation times are effectively eliminated, thus overhead is reduced. Figure 35 shows the overall timing results before and after optimization.

85 72 M1 M2 Local MEM Control Unit Instruction MEM A/S Unit 1 M1 M2 Local MEM Global MEM A/S Unit T Figure 34: Block diagram of the top-level unit in SRC Scheme 1. Notation: MEM-memory; M1; M2-multipliers 1 and 2; A/S-adder/subtractor Results Table 4: Execution Time of Phase 1 and Phase 2 Using SRC-6 Reconfigurable Computer Holding 9 ECM Units for 198- Bit Numbers N; B1 = 960 (Which Implies Number of Bits of k; kn = 1375), B2 = 57000, and D = 210 in SRC Scheme 1 One-time pre-computations Pre-computations by the microprocessor (common to all integers to be factored) Generation of 9 integers to be factored Generation of numbers to be factored Phase 1 Phase μs 7902 μs ECM computations for one set of 9 integers to be factored Pre-computations by the microprocessor 1,368 μs 0 μs (specific to a given set of integers to be factored)

86 73 Transfer of data-in (from the microprocessor memory to the on-board-memory of the FPGA-based processor) Calculations performed by the FPGA-based processor 51 μs 0 μs 17,136 μs 19,152 μs Transfer of data out (from the on-board-memory of the FPGA-based processor to the microprocessor memory) 0 μs 19 μs Function call overhead (overhead associated with the transfer of control between 252 μs the microprocessor and the FPGA) Post-computations by the microprocessor (final GCD computation) 0 μs 81 μs Total end-to-end execution time 18,681 μs 19,378 μs Both Phases μs Percentage of the total end-to-end execution time Pre-computations and post-computations by the microprocessor 7.32 % 0.42% Both Phases 3.81 % Function call and data transfer overheads 0.95 % 0.75 % Both Phases 0.85 % FPGA board computations % % Both Phases % Table 5: Place & Route Summary in SRC Scheme 1 Number of Slice Flip Flops 34,071 out of 67,584 50% Number of 4 input LUTs 41,478 out of 67,584 61% Number of occupied Slices 30,996 out of 33,792 91% Number of Block RAMs 12 out of 144 8% Frequency 99 MHz Execution Time (both Phases) 38 ms

87 74 Table 6: Number of lines of codes in SRC Scheme 1 Number of lines of codes VHDL 3975 MAP-C 106 Estimated Time to Complete 3 students x 1 semester Figure 35: Overall results of execution time in SRC scheme 1

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers

An Implementation Comparison of an IDEA Encryption Cryptosystem on Two General-Purpose Reconfigurable Computers Allen Michalski 1, Kris Gaj 1, Tarek El-Ghazawi 2 1 ECE Department, George Mason University