Virtual EM Inc. Ann Arbor, Michigan, USA

Functional Description of the Architecture of a Special Purpose Processor for Orders of Magnitude Reduction in Run Time in Computational Electromagnetics Tayfun Özdemir Virtual EM Inc. Ann Arbor, Michigan, USA tayfun@virtualem.com SPI 2013, Paris, France

1. Definition of the Problem 2. Algorithm 3. Processor Architecture Organization of the Talk Chip and the User Interface Scalable Run Time Processor Nodes and Mapping of the Algorithm 4. Manufacturing of the Chip SPI 2013, Paris, France, May 13, 2013 2

1. Definition of Problem Use of Computational Electromagnetics (CEM) Signal Integrity during design EMI related to board design and packaging RF Circuit Antennas Slow rate of adoption of CEM tools due to large run times Sequential Programming Use of general purpose hardware Recent Progress Massively parallel machines Graphics Processing Units (GPUs) Multi-core Central Processing Units (CPUs) Giga Floating Point Operations Per Second (Gflops)/$ is still too low MDGRAPE delivers small multiples of Gflops/$ but is expensive SPI 2013, Paris, France, May 13, 2013 3

What is Needed Orders of magnitude increase in Gflops/$ ratio: 100x or 1,000x What is the hold up? Current algorithms require sequential programming General purpose hardware are used The problem with the current algorithms: Based on Galerkin s or Functional formulation Results in Linear System of Equations Sequential Programming for Matrix Fill and Solver Poor scaling in High Performance Computing (HPC) platforms Gflops/$ of GPUs and Multi-core CPUs are increasing but Sequential Programming is holding back the scaling Economy of Scale for GPUs and multi-core CPUs will not alone suffice A new paradigm needed: New Algorithms implemented in the form of Hardware designed specially for CEM: Example: FFT Chips SPI 2013, Paris, France, May 13, 2013 4

2. Algorithm Hardware and the Algorithm are inseparable: Algorithm implemented in the form of hardware What needs to happen for realizing the above? A new numerical algorithm that can be implemented in hardware in a scalable fashion A special purpose processor built to implement the above algorithm Rule: Electrical Numerical Science lags Mechanical by 20 years Computational Fluid Dynamics (CFD) colleagues have already done it! Abandoned Navier-Stokes Equations in favor of Boltzman Equation Simulate the flow of a Newtonian fluid with collision models Simulate streaming and collision processes across a limited number of particles to realize a viscous flow behavior across greater dimensions SPI 2013, Paris, France, May 13, 2013 5

Lattice Boltzman Method (LBM) PowerFLOW software by Exa Corporation (Burlington, MA, USA) Fictitious particles performing consecutive propagation and collision processes over discrete lattice Yields the Navier-Stokes equations in asymptotic expansion Algorithm is highly scalable on HPC platforms (After Lattice Boltzmann Methods for Fluid Dynamics by Steven Orszag, Yale University) SPI 2013, Paris, France, May 13, 2013 6

Conventional CFD vs LBM (After Lattice Boltzmann Methods for Fluid Dynamics by Steven Orszag, Yale University) SPI 2013, Paris, France, May 13, 2013 7

New Algorithm for CEM We are not proposing to replace Maxwell's Equations We are not proposing to abandon the current efforts on developing algorithms with a focus on sequential programming Such efforts will continue to exist and are essential in improving algorithmic efficiency Rather, replacing the Galerkin s and Functional methods that are used today to directly discretize the Maxwell s equations Perhaps the CEM analogy to LBM Multi-pole expansion of the field? New Algorithm must be Maxwellian Two options: New scalable algorithm for existing HPC platforms (as CFD folks did) A new special purpose processor with an accompanying new algorithm SPI 2013, Paris, France, May 13, 2013 8

New Algorithm for CEM We are not proposing to replace Maxwell's Equations We are not proposing to abandon the current efforts on developing algorithms with a focus on sequential programming Such efforts will continue to exist and are essential in improving algorithmic efficiency Rather, replacing the Galerkin s and Functional methods that are used today to directly discretize the Maxwell s equations Perhaps the CEM analogy to LBM of CFD Multi-pole expansion of the field? Two options: New scalable algorithm for existing HPC platforms (as CFD folks did) A new special purpose processor with an accompanying new algorithm What is presented here. SPI 2013, Paris, France, May 13, 2013 9

3. Processor Architecture Inspired by the Finite Element Method (FEM): FEM Chip Expandable to: Finite Difference Time Domain (FDTD) and the Time Domain Finite Element Method (TDFEM) Method of Moments (MoM) requires a bit more thinking. SPI 2013, Paris, France, May 13, 2013 10

FEM Chip and User Interface Geometry Excitation Boundary Conditions FEM Chip Unknowns (E or H Field Vector) PC GUI Engine API FEM PCI Board SPI 2013, Paris, France, May 13, 2013 11

Scalable Run Time ~ O(N) CLOCK (250MHz) ITER 1 ITER 2 ITER 3 ITER N CLOCK One clock cycle = (1/250) micro seconds t (sec) Solution Time = N Clock Cycle = (N / 250) x 10-6 seconds SPI 2013, Paris, France, May 13, 2013 12

Run Times Type of Problem (*) N Run Time A resonant antenna ~10 3 1msec Five wavelength long RF circuit ~10 5 10msec Small Boat ~10 8 1 sec F16 aircraft (**) ~10 9 1 min Large Ship ~10 11 5 hrs (*) One frequency point and/or one look angle at 10GHz (**) U.S. Air Force Challenge since 1970s SPI 2013, Paris, France, May 13, 2013 13

Processor Nodes & Scalable Algorithm Multi-Pole Representation of the EM Field (ongoing research) Map Mesh Nodes to Computing Nodes (ongoing research) P1 P2 P1 P2 P4 P3 P3 P4 P5 P5 P6 P6 MESH COMPUTING NODES SPI 2013, Paris, France, May 13, 2013 14

New Paradigm Nodes must perform simple computations Data sharing must be local Must converge in O(N) iterations, i.e., O(N) clock cycles At each iteration, P1 = a * P1 + b *P2 + c * P3 + d * P5 (ongoing research) Computational Unit (P1 = a * P1 + b *P2 + ) b, P2 c, P3 d, P5 P1 c b P2 P1 b, P1 d P3 P4 RAM = P1 c, P1 d, P1 P5 P6 CLOCK SPI 2013, Paris, France, May 13, 2013 15

New Paradigm Computational Unit (P1 = a * P1 + b *P2 + ) b, P2 c, P3 d, P5 P1 c b P2 P1 b, P1 d P3 P4 RAM = P1 c, P1 d, P1 P5 P6 CLOCK SPI 2013, Paris, France, May 13, 2013 16

New Paradigm Computational Unit (P1 = a * P1 + b *P2 + ) b, P2 c, P3 d, P5 P1 c b P2 P1 b, P1 d P3 P4 RAM = P1 c, P1 d, P1 P5 P6 CLOCK SPI 2013, Paris, France, May 13, 2013 17

New Paradigm Computational Unit (P1 = a * P1 + b *P2 + ) b, P2 c, P3 d, P5 P1 c b P2 P1 b, P1 d P3 P4 RAM = P1 c, P1 d, P1 P5 P6 CLOCK SPI 2013, Paris, France, May 13, 2013 18

4. Manufacturing of the Chip Chip Design & Manufacturing Functionality (ongoing research) HDL via Verilog (~$300K) Low volume prototype chip ($20K/unit via Asian Foundries) PCI Card Design & Manufacturing Application Programming Interface (API) PC System and Benchmarking High Volume Production SPI 2013, Paris, France, May 13, 2013 19

Manufacturing Challenges Chip has to have as many nodes as the number of unknowns 1K for a Resonant Antenna but 100M for a Small Boat at 10GHz 3D Chip not possible today Interconnects between the nodes form a 3D lattice Has to be 2D with today s chip manufacturing technology Parallel nature of the above paradigm (and therefore the run time scaling with N) must be compromised Introduce a level of sequential computational steps Sub-divide the three-dimensional solution space into sections, each of which could be mapped to a two-dimensional grid A reasonably high Gflops/$ ratio could still be achieved. SPI 2013, Paris, France, May 13, 2013 20

GFlops/$ Wars Acceleware Corp: GPUs? Gflops/$1,000 Impulse Technologies: FPGAs? Gflops/$1,000 Appro International: Blade Clusters 4 Gflops/$1,000 (*) IBM: BlueGene L 0.1 Gflops/$1,000 Virtual EM: MDGRAPE machine 21 Gflops/$1,000 Proposed Scheme (estimated) >100 Gflops/$1,000 (*) 2008 numbers SPI 2013, Paris, France, May 13, 2013 21

Next Steps 1. Confirm that the proposed architecture will provide a) Order of magnitude increase in Gflops/$ b) O(N) scaling of Run Time via a) Simulations b) Limited prototyping using simple micro-controllers serving as simple computational nodes 2. Research on Algorithms a) Scalable b) Can be implemented in hardware 3. Manufacturing of the Processor a) 3D Chip (not possible in the near future) b) 2D Chip for 3D Problems with compromised scaling (most likely) c) 2D Chip for 2D Problems 2D problems Body of Revolution (BoR) Problems SPI 2013, Paris, France, May 13, 2013 22

Next Steps 4. Improve scaling of current algorithms on today s hardware GPUs Multi-core CPUs FPGAs ARM-based micro-controllers SPI 2013, Paris, France, May 13, 2013 23