Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

Size: px

Start display at page:

Download "Core Fusion: Accommodating Software Diversity in Chip Multiprocessors"

Martha French
5 years ago
Views:

1 Core Fusion: Accommodating Software Diversity in Chip Multiprocessors Authors: Engin Ipek, Meyrem Kırman, Nevin Kırman, and Jose F. Martinez Navreet Virk Dept of Computer & Information Sciences University of Delaware

2 Outline Motivation for Core Fusion Architecture Dynamic Reconfiguration Experimental Setup Evaluation Conclusion

3 Introduction When running sequential workloads, few powerful cores yield high utilization Sequential codes not sufficient to sustain long-term performance scalability To fully exploit Chip Multiprocessors (CMPs), parallel programming is necessary

4 Motivation Programmers parallelize code incrementally Diverse array of software is likely Applications exert different demands on hardware across phases of same run Composition of CMP is unchangeable after they are fabricated Solution?

5 Motivation A CMP which can dynamically synthesize right composition Core-fusion reconfigurable CMP architecture groups of fundamentally independent cores can dynamically morph into a larger CPU, Still can be used as distinct processing elements, as needed at run time by applications

6 Why Core Fusion Benefits to CMP design Support for software diversity Fine-grain parallelism many lean cores Coarse-grain parallelism fusing many cores into fewer, powerful CPUs Sequential code by executing on one fused core Support for smoother software evolution Support incremental parallelization dynamically provide optimal configuration for sequential and parallel

7 Why Core Fusion Single-design solution Multiple modular structures can be tiled Optimized for parallel code Good isolation across threads in parallel runs Design-bug and hard-fault resilience Bug/hard fault in one core still allows operation of 3 faultfree cores

8 Challenges for Core Fusion Should not increase software complexity Should work around independent nature of base cores Dynamic reconfiguration should be efficient

9 Architecture Builds on top of substrate comprising identical, two-issue out-of-order cores A bus connects private L1 i-cache and d- cache Other side of bus has L2 cache and memory controller

10 Architecture 8 core CMP 2 independent cores two-core fused group four-core fused group.

11 Architecture Front End Collective Fetch - Fetch Management Unit (FMU) receives and re-sends fetch information across cores Fetch Mechanism and Instruction Cache Each core fetches 2 instructions from its own i- cache Fetch is aligned with core zero In case of taken branch, lower-order cores skip fetch In case of i-cache miss, 8 word block is delivered To requesting cores if operating independently To all 4 cores if fused together

12 Architecture Front End (a) 4 sub blocks and one tag within each i-cache constitute a cache block. (b) cache block spans 4 i-caches, each i-cache has a sub block and a tag replica

13 Architecture Front End Branches and subroutine calls Prediction Each core accesses its own branch predictor and Branch Target Buffer (BTB) To maximize utilization, branch predictor and BTB are indexed i bits are used for indexing t bits for tagging (only meaningful in the BTB)

14 Architecture Front End Cores squash overfetched instructions New target starts at Core 1 Core 0 skips the first fetch cycle

15 Architecture Front End Collective Decode/Rename Instruction in fetch group need to be renamed and steered Co-ordinated by Steering Management Unit (SMU) SMU consists of : Global steering table to track the mapping of architectural registers to any core 4 free-lists for register allocation four rename maps steering/renaming logic

16 Architecture Front End SMU Organization

17 Architecture Back End Operand Crossbar Supports operand communication Wake-up and Selection copy instructions are placed in a FIFO queue Once issued, wakes up dependent instructions, update registers Reroder Buffer and Commit Support 4 ROBs commit in lockstep up to 8 instructions per cycle Load/Store Queue banked-by-address load-store queue (LSQ) implementation allows to keep data coherent Doesn t require cache flushes after dynamic reconfiguration

18 Dynamic Reconfiguration How do cores dynamically fuse or separate? Dynamic run-time reconfiguration enabled through simple application interface Application requests core fusion/split actions through a pair of FUSE and SPLIT ISA instructions FUSE and SPLIT instructions are executed conditionally by hardware: based on the value of an OS-visible control register indicative of eligible cores for fusion

19 Dynamic Reconfiguration FUSE Operation Application may request cores to be fused to execute sequential region Instructions following FUSE and i-cache are flushed FMU, SMU, i-caches are reconfigured SPLIT Operation Application advises fused group of parallel region using SPLIT In-flight instructions are allowed to drain, FMU and SMU are reconfigured Core zero starts fetching from the instruction that follows the SPLIT in program order.

20 Experimental Setup Compared Core Fusion against 5 static homogeneous and asymmetric CMP architectures Asymmetric chip multiprocessors (ACMPs) comprise cores of varying sizes and computational capabilities two-, four-, and six-issue out-of-order cores were used as building blocks Estimated the area overhead, latency of cross-core wiring for experiment

21 Configuration of two-issue core

22 Experimental Setup Number and type of cores used

23 Experimental Setup Applications parallel, evolving parallel and sequential workloads

24 Evaluation Sequential Application Performance Speedup over FineGrain-2i when executing SPECINT

25 Evaluation - Sequential Application Performance Speedup over FineGrain-2i when executing SPECFP 6 issue monolithic core obtain average speedups of 73% and 47% on floating-point and integer benchmarks

26 Evaluation - Parallel Application Performance Speedup over single-thread run on FineGrain-2i when executing parallel applications

27 Evaluation Evolving Parallel Application Performance Speedup over stage zero run on FineGrain-2i

28 Evaluation Evolving Parallel Application Performance Performance differences between the best and worst architectures are high. CoreFusion consistently performs the best or rides close to the best configuration.

29 Conclusion Core Fusion allows relatively simple CMP cores to dynamically fuse into larger, more powerful processors Goal is to accommodate software diversity and to dynamically adapt to changing demands of workloads Result is a flexible CMP architecture that adapts to diverse software applications Rewards incremental parallelization with higher performance along the development curve No specialized compiler support, a customized ISA or higher software complexity required

Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012 OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization