A MIMD Multi Threaded Processor

F. Lesser, www.ti.uni-hd.de A MIMD Multi Threaded Processor Falk Lesser V. Angelov, J. de Cuveland, V. Lindenstruth, C. Reichling, R. Schneider, M.W. Schulz Kirchhoff Institute for Physics University Heidelberg, Germany Phone: +49 6221 54 4304 Email: ti@kip.uni-heidelberg.de Email(speaker): lesser@kip.uni-heidelberg.de WWW: http://www.ti.uni-hd.de Outline Introduction Application for for the the MIMD processor Electronics environment The The MIMD processor architecture Summary

F. Lesser, www.ti.uni-hd.de 2 High Energy Physics Study of fundamental constituents of matter and forces between them» quarks, electrons, neutrinos, photon, Z 0, W ±, etc. Higher and higher energies are required to delve deeper and deeper into matter i.e. larger, more powerful, more expensive No longer affordable by individual countries

CERN Conseil Europeen pour la Recherche Nucleaire located in Geneva area Switzerland high energy and nuclear physics research center Established in 1954» kickstart research in Europe Currently funded by 20 European Countries Users: ~6500 scientists from 500 institutes in ~80 countries Next main research program: Large Hadron Collider Member state Participating non member state Start of operation: 2005, data taking ~20 years F. Lesser, www.ti.uni-hd.de 3

F. Lesser, www.ti.uni-hd.de 4 Large Hadron Collider & The Alps 4 Interaction points Geneva International airport ~100m deep Superconducting accelerator 7 TeV p p 5.5 TeV Pb Pb 27km circumference

Nucleons In Collision 8000 collisions per second Pb ~ speed of light Each interaction generating >24 000 particles in acceptance of detector Task: Find within one specific particle pair within 6 µs out of 16 000 charged particles Pb ~ speed of light 6 µs to: digitize 1.2 million data channels @ 10 MHz/10 Bit process 29 Mbytes form global decision Main goal is to create the Quark-Gluon Plasma, A state of matter existing during the first few microseconds after the big bang F. Lesser, www.ti.uni-hd.de 5

F. Lesser, www.ti.uni-hd.de 6 A Particle Detector named ALICE ITS: 2.62 million channels (1 Bit) TPC: 570 000 channels (10 Bit) TRD: 1.2 million channels (10 Bit) 1,425 million ADCs Digitization rate: 10 MHz/10 Bit Peak data rate: 17,8 TB/sec 75 000 MIMD processors Computing time of 6 µs Target clock rate: 120 MHz Magnet (0,4 Tesla) Measures particle trajectories, momentum and provides particle identification

TRD Electronics Overview to/from next MCM (right within padrow) PS 16 Channels.. 16x. Trigger/ Data out µcpu 1.2 1.2 million million analog analog channels channels @ 10 10 MHz MHz 16 16 channels channels per per module module 75000 75000 sources sources 1 trigger trigger bit bit track track segment segment processing processing on on chamber chamber maximum maximum latency latency 6µs 6µs ~20 ~20 space space points points per pertracklet 4-6 4-6tracklets (layers) (layers) per per track track PS PS to/from next MCM (left within padrow) preamp Digital chip Signal layers Multi-Chip-Module : 16 Channels Detector : AGND DGND APWR DPWR F. Lesser, www.ti.uni-hd.de 6 Chips (96 Channel) 24 padrows Chamber 18 subdivisions in φ 5 subdiv. in z 7

Tracklet Fit Concept Pad-No. 4 3 2 1 0 15 14 13 a b Pipelined Calculation : 20 timebin i MCM m MCM m+1 During During Drift Drift Time: Time: N hitcount hitcount y y i position i position x x i timebin timebin sum sum y y i position position sum i i sum x x i y y i timebin*position timebin*position sum sum y 2 i i y i 2 position² position² sum sum i After After Drift Drift Time: Time: a a intercept intercept b b slope slope a χ χ 2 2 track track quality quality merge merge track track segments segments in in padrow padrow 2 i ( xi) x y x xy 2 i i i i i 2 2 i Nx ( x ) 2 i Nxy ii xy i i b Nx A P+1 10 A P-1 A P 10 10 A x Ax/Ac LUT 7 LUT yi 6 2 x i y i x i y i x i 2 6 10 11 12 x i y i x i y i x i 2 y i 2 y i 2 x i y i x i y i y i 2 HC Amplitude Build Ratio F. Lesser, www.ti.uni-hd.de Calculate Position Calculate fit summands Calculate sums Sum Memory 8

TRD Trigger Timing and Data Flow Data reduction 29 MByte 4,35 MByte 1,14 MByte 86.4 kbyte 1 Bit Data bandwidth 18 TByte 1,85 TByte Data ship Global Tracking ADC pipeline latency Calculate Σ fit relevant pipeline ADC output Calculate Tracklets Visible TEC drift area Drift time ADC pipeline latency LHC Clock 40.079 MHz / 24.9501 ns 40 998 80 1996 120 2994 160 3992 200 4990 240 5990 t 0 10 MHz power 120 MHz Highly I/O bound: 17.8 TB/sec Very tight time budget: 1.8 + 1.5 µs F. Lesser, www.ti.uni-hd.de 9

F. Lesser, www.ti.uni-hd.de 10 Next Step: Compute and Combine L22 L20 L13 L12 L11 R02 R03 R10 L02 L03 L10 3 7 11 15 23*2 2 6 10 14 23*2 1 5 9 13 23*2 0 4 8 12 Readout Board (RB) 23*2 23*2 φ 0 φ 1 φ 2 φ 3

Transistors 10 8 10 7 10 6 10 5 10 4 Computer Architecture Trends Bit-level parallelism -level Thread- level (?) i8080 i8008 i4004 i80286 i8086 R10000 Pentium i80386 R3000 R2000 10 3 1970 1975 1980 1985 1990 1995 2000 2005 Performance 1000 100 10 1 Fraction of total cycles (%) 30 25 20 15 10 5 0 0 1 2 3 4 5 6+ µproc 60%/yr. (2X/1.5yr) DRAM 6%/yr. (2X/15 yrs) CPU Number of instructions issued Processor-Memory Performance Gap: (grows 50% / year) DRAM F. Culler, Lesser, Patterson, www.ti.uni-hd.de UC Berkeley 1980 1985 1990 Time 1995 2000 11

Why MIMD with Shared Memory? 1.5 µs computing time @ 120 MHz 180 clock cycles ~120 clock cycles required to finish one task Four tasks have to be done ~ 480 cycles needed The processor must execute up to four tasks on different data objects Arithmetic operations should be executed in one clock cycle Data has to be shared without overhead Quad Ported SRAM (QPM) Global Register File (GRF) Four tasks Four CPUs Same code Quad port instruction memory saves chip area MIMD F. Lesser, www.ti.uni-hd.de 12

F. Lesser, www.ti.uni-hd.de 13 Generic Processor Architecture Sequencer Memory Sequencer Memory Source Bus I/0 Load/ Store Unit ALU Source Bus Register File Write Back Bus Load/ Store Unit Internal RAM Preprocessor Interface I/0 I/0 I/0 Load/ Store Unit Load/ Store Unit Load/ Store Unit Load/ ALU Register Store File Unit Write Back Bus Sequencer Memory Source Bus Load/ ALU Register Store File Unit Write Back Bus Sequencer Memory Source Bus Load/ ALU Register Store File Unit Write Back Bus Internal RAM Internal RAM Internal RAM Sequencer Memory Source Bus I/0 Load/ Store Unit ALU Register File Load/ Store Unit Internal RAM To execute up to four tasks in real time, four CPUs are needed Independent CPUs can't share data or program Write Back Bus

MIMD Architecture Shared I-Mem External I/O Sequencer Sequencer Sequencer Memory Projection of Pre-Processor data inside the MIMD Arbitter Load/ Store Unit FIT ALU Sorce Sorce Bus Bus Source Sorce Data Bus Bus PRF PRF 21 PRF PRF 43 GRF Load/ Store Unit Internal RAM PrePro Write Write Back Back Bus Bus Write Write Back Back Data Bus Bus Four Harvard CPUs coupled by multi port instruction memory multi port data memory Global Register File (GRF) Register based interface to Pre-Processor Two stage pipeline (fetch/decode, execute/write back) Architecture can be adopted to any general purpose processor F. Lesser, www.ti.uni-hd.de Four ALUs Shared data via GRF and D-Mem 14

F. Lesser, www.ti.uni-hd.de 15 Some Features 16 Bit data word 16 private registers per node 16 global registers common to all nodes 24 Kbytes quad port

Set and Format 24 Bit fixed length instruction word 70 instructions in total 22 ALU instructions 26 branch instructions 3 instructions for synchronization 14 Load/Store instructions 4 instructions to handle interrupts 23 23 23 23 23 23 23 17 16 11 10 5 4 0 Opcode Source 1 Source 2 Destination 17 14 11 10 5 4 0 Opcode Immediate Source 2 Destination 17 16 5 4 0 Opcode Immediate Destination 17 16 11 10 Opcode Source 1 Immediate 17 16 4 3 Opcode Immediate Branch 17 16 11 3 Opcode Source 1 Branch 17 15 Opcode Immediate 0 0 0 0 F. Lesser, www.ti.uni-hd.de 16

Quad Port Memory Write data Addresses WE Write data registers Clock WE Address and WE registers Write units Write Data Address Word lines Address decoders Memory Matrix Bit lines Bit cells Read Data Sense amplifiers t w t acc F. Lesser, www.ti.uni-hd.de Read data 18

F. Lesser, www.ti.uni-hd.de 19 A Closer Look Inside the MPM Full custom design used for quad port 16 bit data memory and 24 bit Prechargesignal instruction memory Precharge Delivers/receives data to/from 4 CPUs simultaneously Line 1 Max. access time is about 2 ns 4 (0.18 µm) Needed access time 6 ns...... Precharge 1 Bit 1 Bit Not_bitline1-4 Bitline1-4Not_bitline1-4 Bitline1-4Not_bitline1-4 sel prepower bit not_bit vdd Organized in blocks of 64 lines bit Line width is parametrizable Line 64 4 1 Bit 1 Bit Senseamplifiers...... Senseamplifiers word1 word2 word3 word4 bit vdd gnd word4 word3 word2 word1 not_bit Dout1 Dout2 Dout3 Dout4 Dout1 Dout2 Dout3 Dout4 out Four blocks of the MPM with additional test logic in 0.35 µm process gnd

F. Lesser, www.ti.uni-hd.de 20 Synchronization Three instructions for synchronization SEM sets the synchronization mask SYN suspend the PC SYT copies the synchronization register Implementation of flexible synchronization patterns Synchronization implemented as a side effect of access to the GRF SYN 0000 1000 0000 0000 Write '0' Synchronization Register Write '0' Synchronization Register Write '0' Synchronization Register Write '0' Synchronization Register GRF Assigned to CPU 1 Write Assigned to CPU 2 Assigned to CPU 3 Assigned to CPU 4

F. Lesser, www.ti.uni-hd.de 21 Data I/O Global I/O P1 P2 P3 PN Priority Arbiter CPU 4 CPU 3 Addr Data CPU 2 CPU 1 Direct input from preprocessor via multi ported registers Private I/O to external links System bus for peripheries Arbiter to select a CPU Private I/O Private I/O Private I/O Private I/O Fit Register File

Summary High Energy Physics presents very interesting challenges in the computer science (high throughput, low latency, low overhead, massively parallel processing) Use of Multi Ported Memories (MPM) enables integration of multiple generic processor cores as MIMD unit MPM as global register file allows zero overhead, asynchronous multi processor communication (semaphores, locks, etc.) MPM operating as buffer memory provides scalable, independent access to shared data structures MPM operating as buffered crossbar switch allows tight integration of I/O for network and I/O processors For further information contact www.ti.uni-hd.de F. Lesser, www.ti.uni-hd.de US/PCT Patend Pending: 6,067,595 24