Reconfigurable and Self-optimizing Multicore Architectures. Presented by: Naveen Sundarraj

Reconfigurable and Self-optimizing Multicore Architectures Presented by: Naveen Sundarraj 1 11/9/2012

OUTLINE Introduction Motivation Reconfiguration Performance evaluation Reconfiguration Self-optimization Performance evaluation Self-optimization Applications of RL in computer systems Conclusion 2 11/9/2012

Motivation Transistor size doubles every two years Moore s Law Chip Multiprocessors (CMPs) are attractive alternative to monolithic processors in translating transistor budgets into performance improvements. CMPs have performance limitations. Software overhead exploiting full potential of these chips. Need for software to expose exponentially increasing levels of TLP. 3 11/9/2012

Introduction To meet the challenges created by the adoption and scaling of multicore architectures, we explore versatile CMP architectures. Solution: A reconfigurable CMP substrate that can accommodate software at different stages of parallelization by allowing the granularity of the architecture to be changed at runtime. A self-optimizing memory controller that learns to optimize its scheduling policy on the fly, and adapts to changing memory reference streams and workload demands via runtime interaction with the system. 4 11/9/2012

Reconfiguration Reconfiguration is achieved through a novel reconfigurable mechanism called core fusion. Core fusion An architectural technique that empowers groups of relatively small and independent CMP cores with the ability to fuse into one large CPU on demand. Benefits: Support for software diversity. Support for smoother software evaluation. Single-design solution. Optimized for parallel code. Design-bug and hard-fault resilience. 5 11/9/2012

Core Fusion - design Challenges Increase in software complexity. Restructuring the base cores. Effective dynamic reconfiguration Hardware solution: Re-configurable distributed front-end and i-cache. Effective remote wake up mechanism Re-configurable, distributed load/store queue and d-cache Re-configurable, distributed ROB organization. 6 11/9/2012

Core Fusion - Architecture A bus connects L1 i- and d- caches and provides data coherence. On-chip memory controller reside on the other side of the bus. Cores can execute independently if desired and it is also possible to fuse groups of two or four cores to constitute larger cores. 7 11/9/2012

Modifications to achieve core fusion Front end Fetch mechanism and Instruction cache Branch prediction Return Address Stack Global History Registers Handling Fetch Stalls Collective Decode/Rename 8 11/9/2012

Fetch Mechanism and Instruction Cache Collective Fetch A small coordinating unit called the Fetch Management Unit (FMU) facilitates collective fetch. Fetch mechanism Each core fetches two instructions from its own i-cache every cycle, for a total of eight instructions. On an i-cache miss, an eight-word block is (a) delivered to the requesting core if it is operating independently, or (b) distributed across all four cores in a fused configuration to permit collective fetch. In order to support the above mechanism i-caches are made reconfigurable 9 11/9/2012

Reconfigurable i-cache Each i-cache has enough tags to organize data has enough tags to organize data in two-word sub blocks When running independently four such sub blocks and one tag make up a cache block. When fused, cache blocks span all four i-caches, with each i-cache holding one sub block and a replica of the cache block s tag. 10 11/9/2012

Branches and subroutine calls prediction Each core accesses its own branch predictor and BTB. Branch predictor and BTB are indexed to accomplish maximum utilization while retaining simplicity. The indexing scheme achieves no loss in prediction accuracy. 11 11/9/2012

Branch prediction mechanism In each cycle, every core that predicts a taken branch and also a branch misprediction sends the new target PC to the FMU. FMU selects the correct PC by giving priority to the oldest misprediction-redirect PC first and the youngest branchprediction PC last. On a misprediction, misspeculated instructions are squashed in all cores. 12 11/9/2012

Branch prediction mechanism Core2 predicts branch B to be taken. After two cycles, all cores receive this prediction. They squash overfetched instructions, and adjust their PC. 13 11/9/2012

Global History Register (GHR) Independent and uncoordinated history registers on each core may make it impossible for the branch predictor to learn of their correlation. Solution: GHR is replicated across all cores and updates are coordinated through FMU. 14 11/9/2012

Return Address Stack The target PC of a subroutine call is sent to all the cores by the FMU. Core zero pushes the return address into it RAS. When a return instruction is encountered and communicated to the FMU, core zero pops its RAS and communicates the return address back through the FMU. 15 11/9/2012

Handling fetch Stalls To preserve correct fetch alignment, all fetch engines must stall when fetch stall is encountered by one core. To accomplish this cores communicate stalls to the FMU, which in turn informs the other cores. Once all cores have been informed they all discard at the same time any overfetched instruction. Fetching is resumed in sync from the right PC. 16 11/9/2012

Collective Decode /Rename After fetch, each core pre-decodes its instruction independently. Steering Management Unit (SMU) is used to rename all instructions in the fetch group. SMU consists of a global steering table to track the mapping of architectural registers to any core. 17 11/9/2012

Back-end modifications to achieve core fusion Back end Wake-up and selection Reorder buffer and commit support Load/Store queue organization 18 11/9/2012

Wake-up and selection To support operand communication, a copy-out and copy-in queue are added to each core. When copy instructions reach the consumer core, they are placed in a FIFO copy-in queue. Every cycle, the scheduler considers the two copy instructions at the head, along with instructions in the conventional issue queue. Once issued, copies wake up their dependent instructions and update the physical register file. 19 11/9/2012

Reorder buffer and commit support ROB 1 s head instruction pair is not ready to commit, which is communicated to the other ROBs. Pre-commit and conventional heads are spaced so that the message arrives just in time. Upon completion of ROB 1 s head instruction pair, a similar message is propagated, again arriving just in time to retire all four head instruction pairs in sync. 20 11/9/2012

Load/Store queue organization In fused mode, a banked-by-address load-store queue(lsq) implementation is adopted. This keeps data coherent without requiring cache flushes and supports store forwarding and speculative loads. In the case of loads, if a bank misprediction is detected, the load queue entry is recycled and the load is sent to the correct one. 21 11/9/2012

Dynamic Reconfiguration CMPs support for dynamic reconfiguration to respond to software changes (e.g., dynamic multiprogrammed environments or serial/parallel regions in a partially parallelized application) can greatly improve versatility, and thus performance. FUSE and SPLIT ISA instructions are used. FUSE operation: Application requests cores to be fused to execute sequential regions after executing parallel regions. SPLIT operation: In SPLIT operation, in-flight instructions are allowed to drain and enough copy instructions are generated. 22 11/9/2012

Performance Evaluation Simulation done on parallel, evoking parallel and sequential work loads. 23 11/9/2012

Performance Analysis 24 11/9/2012

Parallel application performance 25 11/9/2012

Why self-optimization? Self-Optimization Efficient utilization of off-chip DRAM bandwidth is a critical issue in designing cost-effective, high performance CMP platforms. Conventional memory controllers deliver relatively low performance because they often employ fixed, rigid access scheduling policies designed for average case application behavior. As a result they cannot learn and optimize the long term performance impact of their scheduling decisions, and cannot adopt their scheduling policies to dynamic workload behavior. 26 11/9/2012

Reinforcement Learning (RL) Reinforcement learning is a field of machine learning that studies how autonomous agents situated in a stochastic environment can learn optimal control policies through interaction with their environment. RL provides a general framework for high performance, self optimizing memory controller design. The memory controller is designed as a RL agent whose goal is to learn automatically an optimal memory scheduling policy via interaction wit the rest of the system. 27 11/9/2012

Advantages of RL based memory controller An RL-based memory controller takes as an input, parts of the system state and considers the long term performance impact of each action it can take. Anticipates the long-term consequences of its scheduling decisions, and continuously optimizes its scheduling policy based on this anticipation. Utilizes experience learned in previous system states to make good scheduling decisions in new, previously unobserved states. Adapts to dynamically changing workload demands and memory reference streams. 28 11/9/2012

RL-Based DRAM schedulers Each DRAM cycle, the scheduler examines valid transaction queue entries. The scheduler maximizes DRAM utilization by choosing the command with the highest expected long term performance benefit. Scheduler first derives a state-action pair for each candidate command under the current system state and uses the information to calculate the corresponding Q-values. Scheduler implements its control policy by scheduling the command with the highest Q-value each DRAM cycle. 29 11/9/2012

Performance Evaluation Performance comparison of in-order, FR-FCFS, RL based and optimistic memory controllers. 30 11/9/2012

DRAM bandwidth utilization evaluation Comparison of DRAM bandwidth utilization of in-order, FR- FCFS, RL-based and optimistic controllers. 31 11/9/2012

Applications of RL in computer systems Autonomic resource allocation decisions in data centers. Autonomous navigation and flight, helicopter control. Dynamic channel assignment in cellular networks. Processor and memory allocation in data centers. Routing in ad-hoc networks. 32 11/9/2012

Performance Review For a 4-core CMP with single channel DDR2-800 memory subsystem(6.4 GB/s peak bandwidth). The RL based memory controller improves the performance of a set of parallel applications by 19% and DRAM bandwidth utilization by 22% over a state-of-the-art FR- FCFS scheduler. For a dual-channel subsystem, the RL-based scheduler delivers an additional 14% performance improvement. Thus performance gap between single-channel configuration and a dual-channel DDR2-800 subsystem wit twice peak bandwidth is reduced. 33 11/9/2012

Conclusions Core fusion allows relatively simple CMP cores to dynamically fuse into larger, more powerful processors. It accommodates software diversity gracefully and dynamically adapts to changing demands by workloads. Core fusion adapts complexity-effective solutions for fetch, rename, execution, cache access and commit. RL based, self optimizing memory controller continuously and automatically adapts its DRAM scheduling policy based on its interaction with the system to optimize performance. RL based self optimizing memory controller efficiently utilizes the DRAM memory bandwidth available in CMP. 34 11/9/2012

Questions? 35 11/9/2012

Thank you! 36 11/9/2012