EECS 570 Lecture 25 Genomics and Hardware Multi-threading

Size: px

Start display at page:

Download "EECS 570 Lecture 25 Genomics and Hardware Multi-threading"

Elmer Wade
5 years ago
Views:

1 Lecture 25 Genomics and Hardware Multi-threading Winter 2018 Prof. Satish Narayanasamy Slides developed in part by Profs. Adve, Falsafi, Hill, Lebeck, Martin, Narayanasamy, Nowatzyk, Reinhardt, Singh, Smith, Torrellas and Wenisch. Special acknowledgement to Prof. Jerger of U. Toronto.

2 GenAX: A genome sequencing accelerator. Fujiki et al. ISCA Custom cloud for genomics

3 Genomics is set to disrupt medicine, agriculture, bio-tech, and more Pharmacogenomics Rare disease Cancer Disease control Food safety Genetic engineering

4 Genome sequencing is going to be as cheap as routine tests

5 Portable sequencers enable genome analysis in the field

6 Informatics is now the bottleneck To sequence 1 human genome 1.5 billion characters (A G T C) ~300 GB 1000 CPU hours 70$ on Amazon AWS

7 Genomics around the world $20 billion digital market in 2020; 10% CAGR THE WORLD S LARGEST genetics research center isn t at Harvard or Stanford or even the NIH. It s

8 BigData 300 GB per human genome 300 Petabyte for just one million human (> facebook data)

9 Moore s law coming to an end?

10 Customized processors: Neural processors

11 Customized processors: Search Microsoft Catapult for Bing

12 Genome sequencing accelerator can reduce time, cost, and size 70$ 30 hrs. ~10x 7$ 10 min ~100x? 1 min

13 Genome accelerator as a service on FPGA cloud

14 ASIC for Genome sequencing +

15 Goals Use hardware accelerators for improving efficiency by two orders of magnitude, and ensure privacy First step: whole human genome secondary analysis in few minutes (down from 1000 CPU hours) Build privacy-preserving analysis using Intel SGX Enable low-cost accurate hand-held sequence analysis First target: Detecting pathogens in the field by adding analysis logic to Oxford Nanopore s MinION Enable efficient and effective tertiary analysis First target: Use machine learning for pharmacogenomics (collaboration with Brian Athey) Liquid biopsy? Future potential Gene editing (CRISPR) Micro-biome Agronomics

16 Initial goals u Accelerate key kernels in GATK best practices pipeline (FASTQ->VCF) Source: Intel

17 Focus Kernels u Approximate string matching u Indexing u Hidden Markov Model (HMM) u Compression u Approximate string matching could help

18 Approximate string matching using ULA

19 Accelerator efficiency vs GPU: An estimate u Approximate string matching of two strings of length n for a maximum edit distance of k GenAx Accelerator GPU Time complexity O(n) O(n^2) Cycles per step 1 ~100 Cycles for n = * * 100 * 100 u Space efficiency u u GPU Volta has 21 billion transistors. Can fit 50, edit state machines in the same area. 10x more cores.

20 Quantum computing is no longer a science fiction

21 Wenisch 2009 Hardware Multithreading Slide 21

22 Performance And Utilization Wenisch 2009 Performance (IPC) important Utilization (actual IPC / peak IPC) important too Even moderate superscalars (e.g., 4-way) not fully utilized Average sustained IPC: <50% utilization Mis-predicted branches Cache misses, especially L2 Data dependences Multi-threading (MT) Improve utilization by multi-plexing multiple threads on single CPU One thread cannot fully utilize CPU? Maybe 2, 4 (or 100) can Slide 22

23 Latency vs Throughput Wenisch 2009 MT trades (single-thread) latency for throughput Sharing processor degrades latency of individual threads + But improves aggregate latency of both threads + Improves utilization Example Thread A: individual latency=10s, latency with thread B=15s Thread B: individual latency=20s, latency with thread A=25s Sequential latency (first A then B or vice versa): 30s Parallel latency (A and B simultaneously): 25s MT slows each thread by 5s + But improves total latency by 5s Different workloads have different parallelism SpecFP has lots of ILP (can use an 8-wide machine) Server workloads have TLP (can use multiple threads) Slide 23

24 Core Sharing Wenisch 2009 Time sharing: Run one thread On a long-latency operation (e.g., cache miss), switch Also known as switch-on-miss multithreading E.g., Niagara (UltraSPARC T1/T2) Space sharing: Across pipeline depth Fetch and issue each cycle from a different thread Both across pipeline width and depth Fetch and issue each cycle from from multiple threads Policy to decide which to fetch gets complicated Also known as simultaneous multithreading E.g., Alpha 21464, IBM POWER5 Slide 24

25 Instruction Issue Wenisch 2009 Time Reduced function unit utilization due to dependencies Slide 25

26 Superscalar Issue Wenisch 2009 Time Superscalar leads to more performance, but lower utilization Slide 26

27 Predicated Issue Wenisch 2009 Time Adds to function unit utilization, but results are thrown away Slide 27

28 Chip Multiprocessor Wenisch 2009 Time Limited utilization when only running one thread Slide 28

29 Coarse-grain Multithreading Wenisch 2009 Time Preserves single-thread performance, but can only hide long latencies (i.e., main memory accesses) Slide 29

30 Coarse-Grain Multithreading (CGMT) Wenisch 2009 Coarse-Grain Multi-Threading (CGMT) + Sacrifices very little single thread performance (of one thread) Tolerates only long latencies (e.g., L2 misses) Thread scheduling policy Designate a preferred thread (e.g., thread A) Switch to thread B on thread A L2 miss Switch back to A when A L2 miss returns Pipeline partitioning None, flush on switch Can t tolerate latencies shorter than twice pipeline depth Need short in-order pipeline for good performance Slide 30

31 Wenisch 2009 CGMT regfile I$ B P D$ CGMT thread scheduler regfile regfile I$ B P D$ L2 miss? Slide 31

32 Fine Grained Multithreading Wenisch 2009 Time Saturated workload -> Lots of threads Unsaturated workload -> Lots of stalls Intra-thread dependencies still limit performance Slide 32

33 Fine-Grain Multithreading (FGMT) Wenisch 2009 Fine-Grain Multithreading (FGMT) Sacrifices significant single thread performance + Tolerates all latencies (e.g., L2 misses, mispredictedbranches, etc.) Thread scheduling policy Switch threads every cycle (round-robin), L2 miss or no Pipeline partitioning Dynamic, no flushing Length of pipeline doesn t matter Need a lot of threads Many threads many register files Slide 33

34 Fine-Grain Multithreading Wenisch 2009 FGMT (Many) more threads Multiple threads in pipeline at once thread scheduler regfile regfile regfile regfile I$ B P D$ Slide 34

35 Simultaneous Multithreading Wenisch 2009 Time Maximum utilization of function units by independent operations Slide 35

36 Simultaneous Multithreading (SMT) Wenisch 2009 Can we multithread an out-of-order machine? Don t want to give up performance benefits Don t want to give up natural tolerance of D$ (L1) miss latency Simultaneous multithreading (SMT) + Tolerates all latencies (e.g., L2 misses, mispredictedbranches) ± Sacrifices some single thread performance Thread scheduling policy Round-robin (just like FGMT) Pipeline partitioning Dynamic, hmmm Example: Pentium4 (hyper-threading): 5-way issue, 2 threads Another example: Alpha 21464: 8-way issue, 4 threads (canceled) Slide 36

37 Simultaneous Multithreading (SMT) map table Wenisch 2009 regfile I$ B P D$ SMT Replicate map table, share physical register file thread scheduler map tables regfile I$ B P D$ Slide 37

38 Issues for SMT Wenisch 2009 Cache interference General concern for all MT variants Can the working sets of multiple threads fit in the caches? Shared memory SPMD threads help here + Same insns share I$ + Shared data less D$ contention MT is good for server workloads To keep miss rates low, SMT might need a larger L2 (which is OK) Out-of-order tolerates L1 misses Large map table and physical register file #mt-entries = (#threads * #arch-regs) #phys-regs = (#threads * #arch-regs) + #in-flight insns Slide 38

39 SMT Resource Partitioning Wenisch 2009 How are ROB/MOB, RS partitioned in SMT? Depends on what you want to achieve Static partitioning Divide ROB/MOB, RS into T static equal-sized partitions + Ensures that low-ipc threads don t starve high-ipc ones Low-IPC threads stall and occupy ROB/MOB, RS slots Low utilization Dynamic partitioning Divide ROB/MOB, RS into dynamically resizing partitions Let threads fight for amongst themselves + High utilization Possible starvation ICOUNT: fetch policy prefers thread with fewest in-flight insns Slide 39

40 SMT vs. CMP Wenisch 2009 If you wanted to run multiple threads would you build a Chip multiprocessor (CMP): multiple separate pipelines? A multithreaded processor (SMT): a single larger pipeline? Both will get you throughput on multiple threads CMP will be simpler, possibly faster clock SMT will get you better performance (IPC) on a single thread SMT is basically an ILP engine that converts TLP to ILP CMP is mainly a TLP engine Again, do both Sun s Niagara (UltraSPARC T1) 8 processors, each with 4-threads (coarse-grained threading) 1Ghz clock, in-order, short pipeline (6 stages or so) Designed for power-efficient throughput computing Slide 40

41 Wenisch 2009 Final Review Winter 2018 Prof. Satish Narayanasamy Slide 41

42 Wenisch 2009 Motivations for parallelism Amdahl s Law Power Stuff from the first half Slide 42

43 Directory Coherence Wenisch 2009 Directory representations Limited pointer, coarse vector, linked list, 3-hop vs. 4-hop protocols Transient states Sharing patterns Migratory Producer-consumer Coherence predictors Slide 43

44 Memory Consistency Wenisch 2009 Language-level consistency DRF (data-races and programmer annotations) End-to-end SC Store atomicity Relaxing Store-to-Load ordering Processor consistency / TSO What kind of optimizations does this allow? Relaxing all ordering Enforce order via fences (RMO, Alpha, PowerPC, ARM) Enforce order via annotations (Weak ordering, Release consistency) What does the HW look like? Slide 44

45 Wenisch 2009 Speculative Memory Consistency Why does speculation work? How to detect ordering violations? How does the hardware work? What are the advantages of continuous speculation? How does this relate to transactional memory? Slide 45

46 Wenisch 2009 Network-on-chip Bus Vs point-to-point Advantage of NoCs over irregular point-to-point networks Topologies Understand trade-offs at a high-level Slide 46

47 Routing & flow control Wenisch 2009 Deterministic, oblivious, adaptive routing How to avoid deadlocks? Circuit vs. packet switching Flit-level flow control & wormhole routing Virtual channels Credit-based and on-off flow control How to determine required buffers to avoid flow-control stalls? Slide 47

48 Router Microarchitecture Wenisch 2009 Slide 48

49 Baseline router pipeline Wenisch 2009 BW RC VA SA ST LT Canonical 5-stage (+link) pipeline BW: Buffer Write RC: Routing computation VA: Virtual Channel Allocation SA: Switch Allocation ST: Switch Traversal LT: Link Traversal Slide 49

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont

EECS 470. Lecture 18. Simultaneous Multithreading. Fall 2018 Jon Beaumont Lecture 18 Simultaneous Multithreading Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi,