A MIMD Multi Threaded Processor

Similar documents
Low-Power, Networked MIMD Processor for Particle Physics

PoS(High-pT physics09)036

High Level Trigger System for the LHC ALICE Experiment

Real-time Analysis with the ALICE High Level Trigger.

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Overview. About CERN 2 / 11

The Nios II Family of Configurable Soft-core Processors

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

A Configurable Radiation Tolerant Dual-Ported Static RAM macro, designed in a 0.25 µm CMOS technology for applications in the LHC environment.

The Memory Hierarchy Part I

1. Introduction. Outline

Recap: Machine Organization

Internal Memory. Computer Architecture. Outline. Memory Hierarchy. Semiconductor Memory Types. Copyright 2000 N. AYDIN. All rights reserved.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Memory Hierarchy Technology. The Big Picture: Where are We Now? The Five Classic Components of a Computer

ECE7995 (4) Basics of Memory Hierarchy. [Adapted from Mary Jane Irwin s slides (PSU)]

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.

The University of Adelaide, School of Computer Science 13 September 2018

CS152 Computer Architecture and Engineering Lecture 16: Memory System

An introduction to SDRAM and memory controllers. 5kk73

Topic 21: Memory Technology

Topic 21: Memory Technology

The GAP project: GPU applications for High Level Trigger and Medical Imaging

Unleashing the Power of Embedded DRAM

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

Detector Control LHC

Caches. Hiding Memory Access Times

The Memory Hierarchy & Cache

CpE 442. Memory System

Ting Wu, Chi-Ying Tsui, Mounir Hamdi Hong Kong University of Science & Technology Hong Kong SAR, China

Virtualizing a Batch. University Grid Center

Memories: Memory Technology

Chapter Seven Morgan Kaufmann Publishers

Basics DRAM ORGANIZATION. Storage element (capacitor) Data In/Out Buffers. Word Line. Bit Line. Switching element HIGH-SPEED MEMORY SYSTEMS

The creation of a Tier-1 Data Center for the ALICE experiment in the UNAM. Lukas Nellen ICN-UNAM

ECE232: Hardware Organization and Design

arxiv: v1 [physics.ins-det] 11 Jul 2015

Page 1. Multilevel Memories (Improving performance using a little cash )

Memory Systems IRAM. Principle of IRAM

Straw Detectors for the Large Hadron Collider. Dirk Wiedner

EC 513 Computer Architecture

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Tracking and compression techniques

ECE 485/585 Microprocessor System Design

Semiconductor Memory Classification. Today. ESE 570: Digital Integrated Circuits and VLSI Fundamentals. CPU Memory Hierarchy.

ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems

ECE468 Computer Organization and Architecture. OS s Responsibilities

CS 426 Parallel Computing. Parallel Computing Platforms

arxiv: v1 [physics.ins-det] 16 Oct 2017

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

RT2016 Phase-I Trigger Readout Electronics Upgrade for the ATLAS Liquid-Argon Calorimeters

! Memory Overview. ! ROM Memories. ! RAM Memory " SRAM " DRAM. ! This is done because we can build. " large, slow memories OR

COMPUTER ORGANIZATION AND DESI

Semiconductor Memory Classification

LECTURE 10: Improving Memory Access: Direct and Spatial caches

Design with Microprocessors

EEC 483 Computer Organization

ECE468 Computer Organization and Architecture. Memory Hierarchy

CSCS CERN videoconference CFD applications

Low-Power SRAM and ROM Memories

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

MARIE: An Introduction to a Simple Computer

The ALICE TPC Readout Control Unit 10th Workshop on Electronics for LHC and future Experiments September 2004, BOSTON, USA

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Course Administration

Where Have We Been? Ch. 6 Memory Technology

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

Registers. Instruction Memory A L U. Data Memory C O N T R O L M U X A D D A D D. Sh L 2 M U X. Sign Ext M U X ALU CTL INSTRUCTION FETCH

Let!s go back to a course goal... Let!s go back to a course goal... Question? Lecture 22 Introduction to Memory Hierarchies

The coolest place on earth

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface

Lecture-14 (Memory Hierarchy) CS422-Spring

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

CPU Pipelining Issues

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Memory Hierarchy. Maurizio Palesi. Maurizio Palesi 1

Memory Hierarchy and Caches

The GTPC Package: Tracking and Analysis Software for GEM TPCs

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

EE/CSCI 451: Parallel and Distributed Computation

EEM 486: Computer Architecture. Lecture 9. Memory

Cycle Time for Non-pipelined & Pipelined processors

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

ISSN: [Bilani* et al.,7(2): February, 2018] Impact Factor: 5.164

Memory Hierarchy: Motivation

Integrated CMOS sensor technologies for the CLIC tracker

14:332:331. Week 13 Basics of Cache

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

The Processor: Instruction-Level Parallelism

! Memory. " RAM Memory. " Serial Access Memories. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

ECE 152 Introduction to Computer Architecture

The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs.

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 23 Memory Systems

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Chapter 4. MARIE: An Introduction to a Simple Computer. Chapter 4 Objectives. 4.1 Introduction. 4.2 CPU Basics

Transcription:

F. Lesser, www.ti.uni-hd.de A MIMD Multi Threaded Processor Falk Lesser V. Angelov, J. de Cuveland, V. Lindenstruth, C. Reichling, R. Schneider, M.W. Schulz Kirchhoff Institute for Physics University Heidelberg, Germany Phone: +49 6221 54 4304 Email: ti@kip.uni-heidelberg.de Email(speaker): lesser@kip.uni-heidelberg.de WWW: http://www.ti.uni-hd.de Outline Introduction Application for for the the MIMD processor Electronics environment The The MIMD processor architecture Summary

F. Lesser, www.ti.uni-hd.de 2 High Energy Physics Study of fundamental constituents of matter and forces between them» quarks, electrons, neutrinos, photon, Z 0, W ±, etc. Higher and higher energies are required to delve deeper and deeper into matter i.e. larger, more powerful, more expensive No longer affordable by individual countries

CERN Conseil Europeen pour la Recherche Nucleaire located in Geneva area Switzerland high energy and nuclear physics research center Established in 1954» kickstart research in Europe Currently funded by 20 European Countries Users: ~6500 scientists from 500 institutes in ~80 countries Next main research program: Large Hadron Collider Member state Participating non member state Start of operation: 2005, data taking ~20 years F. Lesser, www.ti.uni-hd.de 3

F. Lesser, www.ti.uni-hd.de 4 Large Hadron Collider & The Alps 4 Interaction points Geneva International airport ~100m deep Superconducting accelerator 7 TeV p p 5.5 TeV Pb Pb 27km circumference

Nucleons In Collision 8000 collisions per second Pb ~ speed of light Each interaction generating >24 000 particles in acceptance of detector Task: Find within one specific particle pair within 6 µs out of 16 000 charged particles Pb ~ speed of light 6 µs to: digitize 1.2 million data channels @ 10 MHz/10 Bit process 29 Mbytes form global decision Main goal is to create the Quark-Gluon Plasma, A state of matter existing during the first few microseconds after the big bang F. Lesser, www.ti.uni-hd.de 5

F. Lesser, www.ti.uni-hd.de 6 A Particle Detector named ALICE ITS: 2.62 million channels (1 Bit) TPC: 570 000 channels (10 Bit) TRD: 1.2 million channels (10 Bit) 1,425 million ADCs Digitization rate: 10 MHz/10 Bit Peak data rate: 17,8 TB/sec 75 000 MIMD processors Computing time of 6 µs Target clock rate: 120 MHz Magnet (0,4 Tesla) Measures particle trajectories, momentum and provides particle identification

TRD Electronics Overview to/from next MCM (right within padrow) PS 16 Channels.. 16x. Trigger/ Data out µcpu 1.2 1.2 million million analog analog channels channels @ 10 10 MHz MHz 16 16 channels channels per per module module 75000 75000 sources sources 1 trigger trigger bit bit track track segment segment processing processing on on chamber chamber maximum maximum latency latency 6µs 6µs ~20 ~20 space space points points per pertracklet 4-6 4-6tracklets (layers) (layers) per per track track PS PS to/from next MCM (left within padrow) preamp Digital chip Signal layers Multi-Chip-Module : 16 Channels Detector : AGND DGND APWR DPWR F. Lesser, www.ti.uni-hd.de 6 Chips (96 Channel) 24 padrows Chamber 18 subdivisions in φ 5 subdiv. in z 7

Tracklet Fit Concept Pad-No. 4 3 2 1 0 15 14 13 a b Pipelined Calculation : 20 timebin i MCM m MCM m+1 During During Drift Drift Time: Time: N hitcount hitcount y y i position i position x x i timebin timebin sum sum y y i position position sum i i sum x x i y y i timebin*position timebin*position sum sum y 2 i i y i 2 position² position² sum sum i After After Drift Drift Time: Time: a a intercept intercept b b slope slope a χ χ 2 2 track track quality quality merge merge track track segments segments in in padrow padrow 2 i ( xi) x y x xy 2 i i i i i 2 2 i Nx ( x ) 2 i Nxy ii xy i i b Nx A P+1 10 A P-1 A P 10 10 A x Ax/Ac LUT 7 LUT yi 6 2 x i y i x i y i x i 2 6 10 11 12 x i y i x i y i x i 2 y i 2 y i 2 x i y i x i y i y i 2 HC Amplitude Build Ratio F. Lesser, www.ti.uni-hd.de Calculate Position Calculate fit summands Calculate sums Sum Memory 8

TRD Trigger Timing and Data Flow Data reduction 29 MByte 4,35 MByte 1,14 MByte 86.4 kbyte 1 Bit Data bandwidth 18 TByte 1,85 TByte Data ship Global Tracking ADC pipeline latency Calculate Σ fit relevant pipeline ADC output Calculate Tracklets Visible TEC drift area Drift time ADC pipeline latency LHC Clock 40.079 MHz / 24.9501 ns 40 998 80 1996 120 2994 160 3992 200 4990 240 5990 t 0 10 MHz power 120 MHz Highly I/O bound: 17.8 TB/sec Very tight time budget: 1.8 + 1.5 µs F. Lesser, www.ti.uni-hd.de 9

F. Lesser, www.ti.uni-hd.de 10 Next Step: Compute and Combine L22 L20 L13 L12 L11 R02 R03 R10 L02 L03 L10 3 7 11 15 23*2 2 6 10 14 23*2 1 5 9 13 23*2 0 4 8 12 Readout Board (RB) 23*2 23*2 φ 0 φ 1 φ 2 φ 3

Transistors 10 8 10 7 10 6 10 5 10 4 Computer Architecture Trends Bit-level parallelism -level Thread- level (?) i8080 i8008 i4004 i80286 i8086 R10000 Pentium i80386 R3000 R2000 10 3 1970 1975 1980 1985 1990 1995 2000 2005 Performance 1000 100 10 1 Fraction of total cycles (%) 30 25 20 15 10 5 0 0 1 2 3 4 5 6+ µproc 60%/yr. (2X/1.5yr) DRAM 6%/yr. (2X/15 yrs) CPU Number of instructions issued Processor-Memory Performance Gap: (grows 50% / year) DRAM F. Culler, Lesser, Patterson, www.ti.uni-hd.de UC Berkeley 1980 1985 1990 Time 1995 2000 11

Why MIMD with Shared Memory? 1.5 µs computing time @ 120 MHz 180 clock cycles ~120 clock cycles required to finish one task Four tasks have to be done ~ 480 cycles needed The processor must execute up to four tasks on different data objects Arithmetic operations should be executed in one clock cycle Data has to be shared without overhead Quad Ported SRAM (QPM) Global Register File (GRF) Four tasks Four CPUs Same code Quad port instruction memory saves chip area MIMD F. Lesser, www.ti.uni-hd.de 12

F. Lesser, www.ti.uni-hd.de 13 Generic Processor Architecture Sequencer Memory Sequencer Memory Source Bus I/0 Load/ Store Unit ALU Source Bus Register File Write Back Bus Load/ Store Unit Internal RAM Preprocessor Interface I/0 I/0 I/0 Load/ Store Unit Load/ Store Unit Load/ Store Unit Load/ ALU Register Store File Unit Write Back Bus Sequencer Memory Source Bus Load/ ALU Register Store File Unit Write Back Bus Sequencer Memory Source Bus Load/ ALU Register Store File Unit Write Back Bus Internal RAM Internal RAM Internal RAM Sequencer Memory Source Bus I/0 Load/ Store Unit ALU Register File Load/ Store Unit Internal RAM To execute up to four tasks in real time, four CPUs are needed Independent CPUs can't share data or program Write Back Bus

MIMD Architecture Shared I-Mem External I/O Sequencer Sequencer Sequencer Memory Projection of Pre-Processor data inside the MIMD Arbitter Load/ Store Unit FIT ALU Sorce Sorce Bus Bus Source Sorce Data Bus Bus PRF PRF 21 PRF PRF 43 GRF Load/ Store Unit Internal RAM PrePro Write Write Back Back Bus Bus Write Write Back Back Data Bus Bus Four Harvard CPUs coupled by multi port instruction memory multi port data memory Global Register File (GRF) Register based interface to Pre-Processor Two stage pipeline (fetch/decode, execute/write back) Architecture can be adopted to any general purpose processor F. Lesser, www.ti.uni-hd.de Four ALUs Shared data via GRF and D-Mem 14

F. Lesser, www.ti.uni-hd.de 15 Some Features 16 Bit data word 16 private registers per node 16 global registers common to all nodes 24 Kbytes quad port

Set and Format 24 Bit fixed length instruction word 70 instructions in total 22 ALU instructions 26 branch instructions 3 instructions for synchronization 14 Load/Store instructions 4 instructions to handle interrupts 23 23 23 23 23 23 23 17 16 11 10 5 4 0 Opcode Source 1 Source 2 Destination 17 14 11 10 5 4 0 Opcode Immediate Source 2 Destination 17 16 5 4 0 Opcode Immediate Destination 17 16 11 10 Opcode Source 1 Immediate 17 16 4 3 Opcode Immediate Branch 17 16 11 3 Opcode Source 1 Branch 17 15 Opcode Immediate 0 0 0 0 F. Lesser, www.ti.uni-hd.de 16

17

Quad Port Memory Write data Addresses WE Write data registers Clock WE Address and WE registers Write units Write Data Address Word lines Address decoders Memory Matrix Bit lines Bit cells Read Data Sense amplifiers t w t acc F. Lesser, www.ti.uni-hd.de Read data 18

F. Lesser, www.ti.uni-hd.de 19 A Closer Look Inside the MPM Full custom design used for quad port 16 bit data memory and 24 bit Prechargesignal instruction memory Precharge Delivers/receives data to/from 4 CPUs simultaneously Line 1 Max. access time is about 2 ns 4 (0.18 µm) Needed access time 6 ns...... Precharge 1 Bit 1 Bit Not_bitline1-4 Bitline1-4Not_bitline1-4 Bitline1-4Not_bitline1-4 sel prepower bit not_bit vdd Organized in blocks of 64 lines bit Line width is parametrizable Line 64 4 1 Bit 1 Bit Senseamplifiers...... Senseamplifiers word1 word2 word3 word4 bit vdd gnd word4 word3 word2 word1 not_bit Dout1 Dout2 Dout3 Dout4 Dout1 Dout2 Dout3 Dout4 out Four blocks of the MPM with additional test logic in 0.35 µm process gnd

F. Lesser, www.ti.uni-hd.de 20 Synchronization Three instructions for synchronization SEM sets the synchronization mask SYN suspend the PC SYT copies the synchronization register Implementation of flexible synchronization patterns Synchronization implemented as a side effect of access to the GRF SYN 0000 1000 0000 0000 Write '0' Synchronization Register Write '0' Synchronization Register Write '0' Synchronization Register Write '0' Synchronization Register GRF Assigned to CPU 1 Write Assigned to CPU 2 Assigned to CPU 3 Assigned to CPU 4

F. Lesser, www.ti.uni-hd.de 21 Data I/O Global I/O P1 P2 P3 PN Priority Arbiter CPU 4 CPU 3 Addr Data CPU 2 CPU 1 Direct input from preprocessor via multi ported registers Private I/O to external links System bus for peripheries Arbiter to select a CPU Private I/O Private I/O Private I/O Private I/O Fit Register File

22

23

Summary High Energy Physics presents very interesting challenges in the computer science (high throughput, low latency, low overhead, massively parallel processing) Use of Multi Ported Memories (MPM) enables integration of multiple generic processor cores as MIMD unit MPM as global register file allows zero overhead, asynchronous multi processor communication (semaphores, locks, etc.) MPM operating as buffer memory provides scalable, independent access to shared data structures MPM operating as buffered crossbar switch allows tight integration of I/O for network and I/O processors For further information contact www.ti.uni-hd.de F. Lesser, www.ti.uni-hd.de US/PCT Patend Pending: 6,067,595 24