Ultra Low Latency Optical Networks

Size: px

Start display at page:

Download "Ultra Low Latency Optical Networks"

Corey Hensley
6 years ago
Views:

1 Ultra Low Latency Optical Networks O. Liboiron-Ladouceur, B. A. Small, W. Lu, and K. Bergman Department of Electrical Engineering, Columbia University, 5 West 2th Street, New York, NY 27 J. S. Davis, C. Hawkins, J. Park, D. S. Wills, D. C. Keezer, and K. P. Martin School of Electrical and Computer Engineering and Microelectronics Research Center, Georgia Tech Atlanta, GA 3332 Abstract G. D. Hughes Laboratory of Physical Science, College Park, MD 2742 In many supercomputing applications, high data reference locality (HDL) allows hardware and software designers to reduce the impact of long data access latency through caching and migration techniques. Other applications (e.g., cryptography, data mining) exhibit low data reference locality (LDL), forcing system designers to pursue minimum data access latency using non-traditional techniques. This paper presents research on an ultra low latency optical network that is on the order of minimum time of flight limits on data access latency. Employed techniques include a non-blocking, bufferless topology, single pulse packets incorporating wavelength division multiplexing (WDM) for both routing header and payload data, and ultra-fast electro optical interfaces built from commercially available components. Early results from simulations and prototype system components show concept feasibility and potential to impact both LDL and HDL supercomputer applications of the future.. Introduction Supercomputer designs have always recognized the importance of interconnect delay. The Cray was built in a cylinder so that cross-machine wiring would require minimal length. As the physical scale of supercomputers has grown from less than a meter to tens of meters (the NEC Earth Simulator is 5 meters in diameter), techniques to minimize interconnect latency have been sought to reduce intrinsic communication cost and maximize opportunities to exploit parallelism. When application access patterns support explicit or automatic identification of high data reference locality (HDL), a range of techniques can be employed to reduce the performance impact of communication latency. Caching and migration techniques can reposition data to a node or cluster where access latency is small. However, these methods are ineffective when applications exhibit low data reference locality (LDL) since no effective data repositioning is possible. Efforts to reposition data often result in increased access latency. Such applications are common in cryptography and data mining where searches rely on data indirect accesses. To effectively execute LDL applications, there is no substitute for low latency memory access. Given the importance of low access latency, the interconnection network for a physically large system must be rethought. Even the best supercomputers today have access latencies dominated by buffer, routing, and interface delay rather than unavoidable time of flight. In the Earth Simulator, for instance, the internode MPI commutation latency is 8.6 µs [] where as the processor cycle time is 2 ns. This 4,3x differential significantly limits how parallel computation can be exploited. Light in a vacuum can cross the 5 M diameter of the Earth Simulator processor core in 66 ns (83x the cycle time). While this fundamental limit is impossible to reach, the 5 times improvement in communication latency (8.6 µs versus 66 ns) would substantially improve performance, especially in LDL applications. In general, a large-scale supercomputer can be modeled as a circular network core surrounded by processing and memory nodes, as shown in Figure. A limit on communications latency is one round trip from a processor node to a memory node and back. This Ultra Low Latency Optical Networks P l l M Figure : Modeling speed of light limits on data access latency.

2 covers a distance of roughly four times the network physical radius (4l) yielding a minimum delay of 4l/c. While cording paths could reduce the traveled length, they significantly complicate the routing mechanism. The goal of this research is to approach this fundamental limit by exploring a feasible interconnection network (topology, routing, medium, packet format, physical switching, electro optical conversion) where delay is minimized. Our metric of success, latency ratio, is defined as the delivered latency divided by the fundamental limit (T delivered c / 4l). This ratio of the Earth Simulator is 5. The objective of this research is to develop techniques to reach a far lower ratio. Software While software issues are outside the scope of this work, it is unlikely that a traditional message passing software mechanism (M PI) will be compatible with system requirements of a LDL system. Software overhead often adds many hundreds of instructions (and cycles) to the latency of a data access. A non-cached shared memory model with atomic raising and lower operators to support synchronization is a lower overhead approach. Link Medium For a physically large system, optical interconnect provides the greatest opportunity for low latency ratio communications. While free space optics offers the lowest time of flight delay, guided wave (fiber) simplifies physical design (and reduces alignment tolerances) for a modest increase in propagation delay. Furthermore, the use of fiber optics enables the leveraging of commercial telecom technologies for this application. The switch elements are employ Semiconductor Optical Amplifiers (SOAs) to provide both fast switching and amplification of multi spectral optical pulses. Data Vortex Topology For optical communication, electro optical conversions must be minimized to keep latency and cost low. Indirect networks better fit the assumed system architecture, allow centralized switching resources, and are less dependent on locality not found in LDL applications. A wide range of well-studied indirect topologies exists. However, optical buffering is problematic. Since intra-network blocking and output blocking requires buffering, the ideal topology for an optical network would exhibit only input blocking. Here electro-optical conversation can be delayed until network access is available eliminating the need for optical buffering. It also offers process backpressure, reducing the need for explicit traffic regulation. Single Pulse Packets Wavelength Division Multiplexing (WDM) is used to encode both header and payload to maximize the data transfer efficiency and maintain low latency. This single pulse format further simplifies the processing required within each switch element as the decoding of the header bits is achieved by passive wavelength filtering. Fast Electro-Optical Interfaces Ultra -high speed interfaces for the single pulse WDM packets are developed to provide variable and controllable electronic data sources that closely emulate parallel data signals of interest in computer systems. The transmitters accurately reproduce, and accurately control, a variety of timing and encoding methods for the transmitted data. The receiver electronic interface is designed to capture and check single pulse data packets converted from optical receivers, emulating the possible data-capture techniques to deliver appropriate electronic signals to the receiving computer system. Paper Outline This paper is organized as follows. First a background section presents recent supercomputer system information. Then the data vortex is presented, followed by a discussion of single pulse packets and the corresponding fast electro-optical interfaces. The paper is concluded with a summary of future work. Ultra Low Latency Optical Networks 2

3 2. Background This section reviews recent supercomputer systems with special attention on the processor to processor and processor to memory interfaces. Table shows information for several systems. The following paragraphs highlight each system. Table : Supercomputer summary with communication rates. Y ear Computer Measu Ma P red x Es GF/s GF/ s Earth Simulator NEC 35,6 4, 5, ASCI White, IBM SP Power 3 7,226, 8, ASCI White, IBM SP Power 3 4,938, 8, ASCI Red Intel Xeon core 2,379 3,2 9, ASCI Blue Pacific SST, IBM 2,44 3,8 5, 64E Intel ASCI Option Red Pentium,338,8 9, Pro 3 52 Hitachi CP-PACS , 48 Intel Paragon XP/S MP , 768 Fujitsu VPP Link BW Netwk CS BW Network Topology 2.3 GB/s 8 TB/s Crossbar 2 GB/s.2 Crossbar & GB/s SP 2 GB/s.2 Crossbar & GB/s SP Mesh MB/s GB/s Switch MB/s GB/s Mesh MB/s GB/s 3 3 Hyper MB/s MB/s Crossbar D Mesh MB/s MB/s 8 57 Crossbar MB/s MB/s Earth Simulator The Earth Simulator is a distributed memory system consisting of 64 processor nodes connected by a 64 x 64 single-stage crossbar switch. Each node is a shared memory cluster composed of eight arithmetic vector processors (AP), a shared memory system of 6 GBytes, a remote access control unit (RCU), and an I/O processor (IOP). The peak performance of each AP is 8 GFlops. Therefore the total number of processors is 5,2 and the total peak performance and the main memory capacity are 4 TFlops and TB. The data internode transfer rate is 2.3 GB/s x 2 [][2]. ASCI White ASCI White is the third generation (flowing ASCI Red and ASCI Blue) on the way to TeraOPS. This supercomputer, designed by IBM, is located at Lawrence Livermore. The system consists of 52 IBM RS/6 SP Nighthawk-2 nodes operating at 375 MHz. Each node includes 6 processors, memory, and network interface. Internode communication is provided the SP switch that delivers 8MB/s of bandwidth [3]. ASCI Red ASCI Red, the first step in the ASCI Platforms Strategy, is a massively parallel, MIMD computer installed at Sandia National Laboratories. It was the world s first TOPS supercomputer. Standard parallel programming interfaces simplify porting parallel applications to this system [4]. ASCI Blue Pacific SST The ASCI Blue-Pacific supercomputer is partitioned into 2 major clusters: the open or public cluster and the closed or private cluster. The two supercomputers are based on SP architecture MPP systems. The open system consists of, MHz PowerPC 64e processors and the larger closed computer is a TeraOp machine of 5,856 processors. The open system is the machine that appears on the Top 5 lists with a peak quoted speed of Ultra Low Latency Optical Networks 3

4 .9 Tflop/s and a sustained Linpack benchmark of.46 Tflop/s. In theory the closed machine has a peak speed of 3.8 Tflop/s, but these results have not yet been reported [8][9]. Intel ASCI Option Red The design of the ASCI option red supercomputer is loosely based on the Intel Paragon supercomputer. The Paragon supercomputer used a 2D mesh interconnection facility (ICF) that could move messages at a peak unidirectional bandwidth of 2 Mbytes per second. Each Paragon node held two (the GP node) or three (the MP node) Intel i86 XP processors [4][]. Hitachi CP-PACS The CP-PACS is an MIMD (Multiple Instruction-streams Multiple Data-streams) parallel computer with a theoretical peak speed of 64Gflops and a distributed memory of 28Gbyte. The system consists of 248 processing units (PU's) for parallel floating point processing and 28 I/O units (IOU's) for distributed input/output processing. These units are connected in an 8x7x6 three-dimensional array by a Hyper Crossbar network. A well-balanced performance of CPU, network and I/O devices supports the high capability of CP-PACS for massively parallel processing [5][7]. Intel Paragon XP/S MP The Paragon is a commercialized offspring of the experimental Touchstone Delta system. The latter machine was built for the Concurrent Supercomputing Consortium at CalTech. The Delta system used i86 processors as computational elements in its nodes but, unlike its predecessor, the ipsc/86, the nodes were not arranged in a hypercube topology but in a 2-D grid. The Delta achieved a speed of.9 GFlops/s was reported for an order 2, full linear system. The Paragon s i86/xp has processor communication hardware on-chip to increase communication bandwidth [5]. Fujitsu VPP-5 The VPP5 vector parallel processor is a highly parallel, distributed memory supercomputer that has a performance range of 6.4 to 355 gigaflops and a main memory capacity from to 222 gigabytes. The system supports between 4 and 222 processors interconnected by a high-bandwidth crossbar network. The VPP5 is built from a custom.6 gigaflops vector processor and 8 MB/sec point-to-point bandwidth between nodes [5][6]. Trends Beyond the obvious trend towards higher performance, there is also significant growth in the internode communication network and memory bandwidth. Based on these trends, the targets are shown in Table. Note that Gops/sec is targeted sign the applications base is on non-floating point formats. The research outlined in the following sections offer possible paths to significant improvements in network bandwidth and latency. 3. Data Vortex Table 2: Target network parameters for research. Max PEs Gops/s,, Link BW 8 GB/s Net BW CS Network Topology 8 TB/s Data Vortex The Data Vortex architecture [4,5] can be viewed as a collection of richly connected routing nodes on multiple fiber cylinders, as seen in Fig.2. The switch fabric size is characterized by two parameters (A,H), representing the number of switching elements nodes along the angle and height dimensions respectively. Parameter A is typically set to be a small odd number (<), and is independent of the choice of H. The available number of input/output (I/O) ports is given by the product HxA. The number of cylinder levels (C) scales as: C = (log 2 H + ). In Fig.2, a switch fabric of (A,H)=(5, 4) is shown with a top view of the routing tours and with a side view of the interconnection patterns at each of the C=3 cylinders. Each cross point shown is a routing node, which can be labeled uniquely by the coordinates (a,c,h), where <a<a, <c<c and <h<h. Ultra Low Latency Optical Networks 4

5 Packets are processed synchronously in a highly parallel manner. Each packet is of a fixed length, and is routed in a slotted manner. Within each clock cycle, every packet in the switch progresses by one angle forward in the given direction either along the solid line towards the same cylinder or along the dashed line towards the inner cylinder. The solid routing paths along the same cylinder are shown in Fig.2 from the side view of each cylinder. These connection patterns are carefully designed and repeat from angle to angle to minimize the packet deflection probability. The dashed-line paths between neighboring cylinders maintain the same height index, h as they are used to forward the packets. As shown, packets are injected at the outermost cylinder (c=) from the input ports, and emerge at the innermost cylinder (c=log 2 H) towards the output ports. Each packet is self-routed in the fashion of binary-tree decoding as it propagates from the outer cylinder towards the inner cylinder. Every cylinder progress fixes a specific bit within the binary header address. The innermost cylinder (c=log 2 H) also allows the packet to circulate around when the output buffers are busy. To avoid packet contention, the switching architecture employs a synchronous and distributed control mechanism to properly schedule the neighboring packet flow. As a result, each node encounters at most one packet at a time and no optical buffering will be necessary within the Data Vortex switch fabric. This also greatly simplifies the routing procedure at each hop and facilitates the photonic implementation of the architecture. Although packet deflection occurs under certain traffic conditions, the probability of that event and its incurred penalty are minimized. This is achieved because packets are provided multiple paths to the destination and are always provided an open path by staying on the same cylinder if they are deflected. The angle dimension thus provides a virtual buffering mechanism for the deflected packets while eliminating the potential packet conflict. Importantly, the hierarchical routing procedure allows the employment of single bit WDM packet encoding, by which the single-bit based routing is accomplished by wavelength filtering in the header retrieval process. Traffic Flow Control and Routing A key design of the system is a distributed control signaling mechanism among routing nodes to achieve the buffer-less operation and simple routing logic. With the embedded synchronous timing, this scheme can schedule the traffic flow of the neighboring nodes properly so that packet conflict is eliminated. To implement the scheme, control lines are applied between any pair of nodes, which have competitive output paths. To see this more clearly, a small group of nodes around node C (,,2) are shown in Fig.3, where each node is labeled by coordinate (a,c,h). A specific example of control signaling (vertical line) between node A (,,3) and node B (,,2) is shown because they both send packets to node C (,,2). The mechanism is very simple: a deflection control message is automatically triggered from node A to node B whenever A sends a packet to C. Since it took a latency of d to deliver the control message, the packet at node A must be slightly earlier than the packet at node B for proper scheduling. Figure 2: Data Vortex topology (A,H) =(5,4) with routing tours seen from the top and the side. Each node is labeled uniquely by the coordinate (a,c,h), where <a<a, <c<c and <h<h. Figure 3: Control signaling (vertical lines) between competing nodes within the neighborhood Ultra Low Latency Optical Networks 5

6 If both B and A have packets addressed to C, the control message is then able to prevent B's packet from progressing to C in the same packet slot. The deflected packet remains at its current cylinder instead propagating to node D in Fig.3. A virtual buffering mechanism is thus provided for the deflected packet and only a slight latency penalty will be introduced. This is because the deflected packet recovers its direction vector towards the target every other clock cycle (in two hops) by staying on the same cylinder. The system can be kept synchronous by properly designing the link latency. As shown in Fig.3, d and d2 represent the propagation latencies for the same-cylinder link (East out path) and the neighbor-cylinder link (South out path) respectively. In practical implementations, the control latency d only takes a small percentage of the packet period because physically the competing nodes are located close to each other. Therefore by simply making d2+d=d, the switch system is able to maintain synchronous operation as well as allow the correct setup of the control mechanism. Performance This section presents initial simulation results comparing the data vortex and butterfly topologies. A message in the simulation is assumed to be one packet long, and each link only holds one packet per cycle. Traffic load is calculated as a percentage of the input ports into which packet injections are attempted on each cycle. For example, if there are 52 input ports and the load is to be 5%, a packet injection is attempted on 256 randomly-chosen input ports in each cycle, with each packet having a randomly-generated output destination address. The acceptance rate is calculated as the percentage of attempted injections that are successful. The data vortex performs virtual buffering by allowing packets to propagate around the angles on each clock cycle in an always-moving fashion, while the butterfly performs single message input buffering at each of the inputs of its constituent 2x2 switches. The data vortex switch allows direct optical switching of WDM-packed pulses without need for OEO conversion. Because the butterfly network can block at each switch, it must perform OE conversion to allow electrical buffering followed by EO conversion to reproduce the optical signal. An all electrical butterfly eliminates OE and EO conversions, but eliminates the opportunity for WDM packing. Since the data vortex employs links as virtual buffers, its angle dimension is used in this comparison to provide buffering only where as cylinder height determines number of inputs. While this increases the number of cross-section links, it substantially reduces switch complexity and size over a butterfly switch for the afore mentioned reasons. In this simulation, the angle is set to 4. Figure 4 shows packet acceptance versus offered load. The data vortex exhibits a higher acceptance rate for 24 input ports. This is due to the fact that the data vortex does not block packets from leaving the input nodes due to the lack of buffering and "always-moving" nature of data handling (i.e., the data packets flow away from the input and into lower cylinders of the vortex rather than buffer near the inputs blocking additional packets from entering the network). Packets in the butterfly network often block at the first level of the topology due to output contention in the 2 x 2 switches, preventing additional packets from entering the network. Accepted Traffic (%) Packet Acceptance Rate for 24 Inputs Data Vortex Butterfly Offered Traffic (%) Figure 4: Packet acceptance rate versus offered load. Ultra Low Latency Optical Networks 6

7 Figure 5 shows that for varying network I/O sizes and a fixed % network workload, the data vortex accepts about 73% of all injected packets that are attempted, whereas the butterfly accepts only 3-44%. The data vortex exhibits roughly the same average number of hops per packet from input to output as the butterfly, while carrying about twice as many packets. Accepted Traffic (%) Packet Acceptance at Maximum Offered Load Network Size Data Vortex Butterfly Figure 5: Packet acceptance rate versus network size at maximum offered traffic. 4. Single Pulse WDM Packets and Ultrafast Electro-Optic Interface System Description We developed a single-bit WDM transmission test-bed that demonstrates the feasibility of this low latency optical link [5]. The optical packets are a single bit in duration and encoded along the wavelength domain. In this fashion the latency associated with the parallel-to-serial and serial-to-parallel conversion is eliminated. The DWDM optical link provides ultra-high capacity data transmission in a cost effective manner by leveraging components from commercial telecom technologies. The optoelectronic digital interface to the optical link times and formats the signals, distributes the clocks, and captures pre-processes incoming data. In the demonstrated test-bed four-bit parallel data are NRZ modulated and transmitted synchronously during each clock cycle to DWDM transmitters of different channel wavelength. An additional fifth presence bit is co-transmitted to the data valid. Each bit is encoded along a different channel wavelength within the C-Band (525nm-625nm). The five bits are then multiplexed in a DWDM arrayed waveguide grating and transmitted along a single fiber. The fiber link dispersion is carefully managed to assure precise bit timing and reduce the skew between channels. Incoming data from the fiber is de-multiplexed by another DWDM arrayed waveguide grating into the four-bit parallel word and the presence bit and converted to electrical pulses through optical receivers. The four-bit parallel electrical pulses are sampled by high-speed PECL circuits and transmitted to the logic interface. The interface performs analysis on the data for pattern error detection. A block diagram of the test bed is shown in Figure 6. Four digitally-programmable delay PECL chips set variable pulse widths and delays. The delay range is ns with a resolution of ps. The four-bit data words are synchronized, while the presence bit is offset to signal when the data is valid. Four high-speed input channels are also provided with an input for a high-speed clock. The receiver channels are designed to operate either independently or in synchronous operation with the transmitter. The data is captured using a PECL register and returned to the high-speed interface where data analysis is performed. The constructed high-speed electro-optical interface board is shown in Figure 7. 5 E-O Transmitter Ch.? Ch. 2? 2 Ch. 3? 3 Ch. 4? 4 Presence Bit RF Clock Source DWDM MUX Transmitter Block Fiber DWDM DEMUX Receiver Block Digital Core Block Ch.? Ch. 2? 2 Ch. 3? 3 Ch. 4? 4 Presence Bit USB 5 E-O Receiver Figure 6: Bit -parallel interconnection test bed Ultra Low Latency Optical Networks 7

Power Connector Clock Distribution Tx Delay Generator Tx Data Formatter Tx Outputs Digital Test Core Rx Logic Rx Inputs Figure 7: Photograph of high-speed electro-optic interface board In Figure 8,

The current FPGA used in the interface electronics is limited to a maximum data rate of 622Mbps. This sets the upper limit on the demonstrated bit-parallel word rate.

8 Power Connector Clock Distribution Tx Delay Generator Tx Data Formatter Tx Outputs Digital Test Core Rx Logic Rx Inputs Figure 7: Photograph of high-speed electro-optic interface board In Figure 8, input data signals consisting of two four-bit parallel words ( & ) are shown. They are generated by four digitally-programmable delay PECL chips which can set variable pulse widths and delays. The current FPGA used in the interface electronics is limited to a maximum data rate of 622Mbps. This sets the upper limit on the demonstrated bit-parallel word rate. Nevertheless, the obtained data pulse widths were approximately 3ps indicating that much faster word rates are achievable. A minimum width data pulse is shown in Figure 9. Ch.? First 4-bit word Ch. 2? 2 Second 4-bit word Ch. 3?3 Ch. 4? 4 Presence Bit Figure 8: Four-bit parallel words and Figure 9: Minimum width pulse from the optoelectronic transmitter interface. Ultra Low Latency Optical Networks 8

The overall link latency is the sum of contributions from the interface which includes the transmitter and receiver and from the propagation time of the lightwave.

9 The four channels optical channels were selected to enable investigations of possible channel cross-talk by closely spacing two channels. The second pair of channels were set a the edge of the C-band to evaluate the gain bandwidth. The optical spectrum of the link is shown in Figure. The overall link latency is the sum of contributions from the interface which includes the transmitter and receiver and from the propagation time of the lightwave. In the current test-bed the overall measured latency is approximately 4.5ns as shown in Figure. We find that the dominant contribution to this latency is in the transmitter module. Ch. 2? =55.72nm Ch. 3? =554.94nm Ch.? =537.4nm Presence Ch. 4 Bit? =556.55nm Electrical Differential Input Data Signal Optical TX Output Data Signal Electrical Differential Output Data Signal Figure : Spectrum of the fiber link with the four-bit word with its presence bit Figure : Bit-parallel interconnection signals It is expected that future commercial optical transmitters will generate lower delays. Importantly we note that in the DWDM bit-parallel system presented the latency will not scale with the number of wavelength channels. The current physical setup is shown in Figure 2. With four data channels and a presence channel, all operating differentially, a total of 2 cables are used for the transmitter and receiver. This setup was originally planned to divide the optical components from the electrical components, since each was being developed by separate research teams. The next generation of this project will be a single board with all of the components integrated on it. This will not only simplify the physical aspect of connecting the two projects, but also minimize transmission line effects from the cabling. The next generation of the optoelectronic tester is currently in the design stage. It will operate up to Gbps using new SiGe parts and all components will be integrated into one circuit board. The final goal is to reach 28 digital channels combined using WDM into one fiber producing an aggregate data rate of.28 Terabitsper-second over one optical channel. Figure 2. Opto-electronic Setup Ultra Low Latency Optical Networks 9

10 6. Summary and Future Work The techniques presented in the paper will enable significant reductions in data access latency in supercomputers executing LDL applications. The methodologies employed include [] A fiber-based link medium that simplifies physical design (and leverages off of existing telecom). [2] Wavelength Division Multiplexing (WDM) is used to encode both header and payload to maximize the data transfer efficiency and maintain low latency. [3] Ultra-high speed interfaces for the single pulse WDM that provide variable and controllable electronic data sources that closely emulate parallel data signals of interest in computer systems. [4] A novel switch fabric architecture specifically designed for optical implementation that solves the optical buffering issue by forcing network blocking to inputs. [5] A hierarchical routing procedure that allows the employment of single bit WDM packet encoding, by which the single-bit based routing is accomplished by wavelength filtering in the header retrieval process. This multi-faceted approach combines the strengths of high-speed electronics and photonics technologies in synergy to enable a scalable, low latency optical network implementations that will propel future supercomputers to new levels of performance. Our future work will include a complete demonstration of a multinode network test-bed based on the Data Vortex architecture incorporating the various physical subsystems described. References [] Earth Simulator: [2] Tetsuya Sato, Shigemune Kitawaki, and Mitsuo Yokokawa, Earth Simulator Running 22, [3] ASCI White: [4] ASCI Red: [5] Parallel Computing Hardware : [6] T. Utsumi, M. Ikeda, and M. Takamura, Architecture of the VPP5 Parallel Supercomputer, Proceedings of the 994 Conference on Supercomputing, pg , December 994, Washington, DC. [7] Netlib Repository, [8] ASCI Blue Pacific News [9] H. Franke, J. Jann, J. E. Moreira, P. Pattnaik, and M. A. Jette, An Evaluation of Parallel Job Scheduling for ASCI Blue-Pacific, Proceedings of Supercomputing 999, Portland, Oregon, November 999, [] T. Mattson and G. Henry, The ASCI Option Red Supercomputer, Proceedings of the Intel Supercomputer Users Group, Albuquerque, New Mexico, June 997, [] IBM Redbooks Understanding and Using the SP Switch, [2] Lawrence Livermore National Laboratory, ASCI program homepage [3] TOP 5 Supercomputer site [4] C. Reed, Multiple level minimum logic network, U.S. Patent 59962, Nov.3, 999. [5] Q. Yang, K. Bergman, G. D. Hughes and F. G. Johnson, WDM Packet Routing for High Capacity Data Networks, J. Lightwave Technology, Vol.9, pp.42 (2). [6] J.S. Davis, D.C. Keezer, O. Liboiron-Ladouceur, K. Bergman, Application Details for Embedded Digital Test Core: Optoelectronic Test Bed and Wafer-level Prober, Proc. Of the Int l Test Conf., (paper submitted) Ultra Low Latency Optical Networks

Ultra-Low Latency, Bit-Parallel Message Exchange in Optical Packet Switched Interconnection Networks

Ultra-Low Latency, Bit-Parallel Message Exchange in Optical Packet Switched Interconnection Networks O. Liboiron-Ladouceur 1, C. Gray 2, D. Keezer 2 and K. Bergman 1 1 Department of Electrical Engineering,