FPGA-Accelerated Instrumentation

Size: px

Start display at page:

Download "FPGA-Accelerated Instrumentation"

Veronica Joseph
5 years ago
Views:

1 ROTOFLEX: FGA-Accelerated Instrumentation Michael K. apamichael, Eric S. Chung, James C. Hoe, Babak Falsafi, Ken Mai {echung, jhoe, babak, ROTOFLEX Computer Architecture Lab at 19-Aug-2008 Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.

2 The Simulation Bottleneck erformance Simulation via Simulation Sampling perf. measurements by sampling small segments of execution Execution Warming (functional simulation) Checkpoints (i.e. system state snapshots ) Detailed Warm-up (cycle-accurate simulation) Measurement (cycle-accurate simulation) Long & NOT arallelizable Short & arallelizable! Speed of cycle-accurate simulator inconsequential Warming is the Real Bottleneck 2

3 Faster Simulation w/ FGAs Warming requires Full-system functional simulator (e.g. Simics) Instrumentation (e.g. functional cache model) SW HW? SW-based Simics (16-cpu) HW-based BlueSARC (16-cpu) CM Cache Model Branch redictor Model CM Cache Model Branch redictor Model Instrumented HW Simulator Fast Warming 3

4 HW vs. SW Simulation erformance Warming requires Speedup: SW 37x HW Full-system functional simulator (e.g. BlueSARC Simics) Instrumentation (e.g. functional cache Simics-fast model)? BlueSARC w/ instrumentation Simics w/ instrumentation WITH Instrumentation SW-based HW-based Simics BlueSARC 20 CM Cache Model 10 0 Branch redictor Model CM Cache Model Branch redictor Model 4

5 Outline BlueSARC Simulator (1-slide review) FGA-Accelerated Instrumentation CM Cache Simulator Branch redictor Simulator Design Experiences & Future Work BlueSARC CM Cache Model Branch redictor Model 5

6 Outline BlueSARC Simulator (1-slide review) FGA-Accelerated Instrumentation CM Cache Simulator Branch redictor Simulator Design Experiences & Future Work BlueSARC CM Cache Model Branch redictor Model 6

BlueSARC Simulator Full-system HW-based Simulator Models 16-cpu UltraSARC III server Can boot OS, run commercial apps Virtualization Techniques 1

7 BlueSARC Simulator Full-system HW-based Simulator Models 16-cpu UltraSARC III server Can boot OS, run commercial apps Virtualization Techniques 1 Hybrid Full-System Simulation 2 Multiprocessor Host Interleaving 1 2 CU 2 Memory Devices 4-way 4-way Common-case behaviors Uncommon behaviors Memory 7

8 Outline BlueSARC Simulator (1-slide review) FGA-Accelerated Instrumentation CM Cache Simulator Branch redictor Simulator Design Experiences & Future Work BlueSARC CM Cache Model Branch redictor Model 8

9 CM Cache Model iranha-like CM Cache Hierarchy rivate L1 I&D Caches Single Shared L2 Cache (Victim Cache) L1 coherence maintained through directory in L2 Target Cache Model Multiple concurrent memory refs Directory for coherence Virtualized Cache Model Memory refs serialized arallel L1 accesses for coherence L1 L1 L1 L1 L1 L1 L1 L1 Shared L2 Directory Shared L2 9

10 Architecture FGA-Accelerated CM Cache Simulator L1 I&D Caches Instruction Caches L2 Cache 8 ways Memory Refs Statistics Data Caches Cache Contents Statistics 2-way L1 caches 8-way pseudo-lru Statistics 8-way L2 cache 10

11 Implementation Details 100MHz on BEE2 board 2500L of fully parameterized Verilog arameters: # CUs, L1/L2 dimensions, # ways, etc urely Model No timing info Only tags + status bits stored and updated FGA Resource Usage (Virtex II ro 70) Limitations 64KB L1s - 4MB L2 128KB L1s - 16MB L2 LUTs 7483 (11%) 7277 (11%) BRAMs 134 (40%) 292 (89%) FGA resource usage dominated by on-chip memory 11

12 Outline BlueSARC Simulator (1-slide review) FGA-Accelerated Instrumentation CM Cache Simulator Branch redictor Simulator Design Experiences & Future Work BlueSARC CM Cache Model Branch redictor Model 12

13 Branch redictor Model Typical 2-level Branch redictor Meta predictor selects Bimodal or Gshare predictor 8-way Branch Target Buffer 16 BTBs (one per cpu) too large for BEE2 FGA Target B Model One BTB per CU Virtualized B Model Single Shared BTB for all CUs Meta Bimodal Gshare Meta Bimodal Gshare Meta Bimodal Gshare Meta Bimodal Gshare BTB BTB Single Shared BTB 13

14 Overall rediction Accuracy (%) Multiple BTBs vs. Single BTB OK to use single BTB? Generally no, but OK for warming of homogeneous workloads Separate BTBs vs. Single BTB (16K-entry, 8-way) Separate BTBs Single BTB db2 oracle apache dss em3d ocean Single BTB achieves same accuracy as multiple BTBs 14

15 Implementation Details 100MHz on BEE2 board 700L of fully parameterized Bluespec arameters: # CUs, redictor Sizes, BTB Size/Associativity Realistic rototype Configuration 16 CUs 8K-entry Meta, 32K-entry Bimodal, 8K-entry Gshare Single shared 16K-entry 8-way BTB FGA Resource Usage (Virtex-II ro 70) LUTs: 3938 (5%) BRAMs: 193 (58%) Limitations Single shared BTB may not perform accurately for all workloads 15

16 Outline BlueSARC Simulator (1-slide review) FGA-Accelerated Instrumentation CM Cache Simulator Branch redictor Simulator Design Experiences & Future Work BlueSARC CM Cache Model Branch redictor Model 16

17 Design Experiences Identify opportunities for simpler designs Virtualization reduces resource requirements/complexity Less-constrained functional simulation environment Think about specific requirements of application Efficient mapping to FGA resources is crucial Reorganizing the cache modules allowed for 2x larger designs Existence of SW reference design is important Reduces design time Simplifies verification Bluespec reduces design complexity 17

18 Future Work Other Instrumentation Applications Software Monitoring/Analysis e.g. debugging, performance tuning, instruction set profiling Rapid Exploration of new Architectures Simple functional models for first-order perf. results Detailed cycle-accurate models for high-fidelity simulation SW Developer/Educational Tool Real-time viewing of system state and statistics (Check out our DEMO ) Future Directions Scale number of CUs Augment simulation models with timing extensions 18

19 Demo Web-based Real-time Viewing of Statistics 19

20 Thanks! Any questions? Acknowledgements We would like to thank our colleagues in the RAM and TRUSS projects. 20

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs rotoflex Tutorial: Full-System M Simulations Using FGAs Eric S. Chung, Michael apamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai ROTOFLEX Computer Architecture Lab at Our work in this