RTLCheck: Verifying the Memory Consistency of RTL Designs

RTLCheck: Verifying the Memory Consistency of RTL Designs ABSTRACT Yatin A. Manerkar Daniel Lustig Margaret Martonosi Michael Pellauer Princeton University NVIDIA {manerkar,mrm}@princeton.edu {dlustig,mpellauer}@nvidia.com Paramount to the viability of a parallel architecture is the correct implementation of its memory consistency model (MCM). Although tools exist for verifying consistency models at several design levels, a problematic verification gap exists between checking an abstract microarchitectural specification of a consistency model and verifying that the actual processor RTL implements it correctly. This paper presents RTLCheck, a methodology and tool for narrowing the microarchitecture/rtl MCM verification gap. Given a set of microarchitectural axioms about MCM behavior, an RTL design, and user-provided mappings to assist in connecting the two, RTLCheck automatically generates the SystemVerilog Assertions (SVA) needed to verify that the implementation satisfies the microarchitectural specification for a given litmus test program. When combined with existing automated MCM verification tools, RTLCheck enables test-based full-stack MCM verification from high-level languages to RTL. We evaluate RTLCheck on a multicore version of the RISC-V V-scale processor, and discover a bug in its memory implementation. Once the bug is fixed, we verify that the multicore V-scale implementation satisfies sequential consistency across 56 litmus tests. The JasperGold property verifier finds complete proofs for 89% of our properties, and can find bounded proofs for the remaining properties. CCS CONCEPTS Computer systems organization Parallel architectures; Hardware Functional verification; Assertion checking; KEYWORDS Memory consistency models, automated verification, RTL, SVA ACM Reference format: Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi, and Michael Pellauer. 2017. RTLCheck: Verifying the Memory Consistency of RTL Designs. In Proceedings of MICRO-50, Cambridge, MA, USA, October 14 18, 2017, 14 pages. https://doi.org/10.1145/3123939.3124536 1 INTRODUCTION Memory consistency models (MCMs) specify the values that can be legally returned by load instructions in a parallel program, and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MICRO-50, October 14 18, 2017, Cambridge, MA, USA 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-4952-9/17/10... $15.00 https://doi.org/10.1145/3123939.3124536 hence they form a critical and fundamental part of the specification of any shared-memory architecture. An ISA-level MCM serves as both a target for compilers and a specification for hardware to implement. Weaker ISA-level MCMs allow a variety of reorderings of memory operations in an effort to improve performance [6, 25], while implementations of stronger MCMs may use complicated speculation to retain performance while still enforcing stronger ordering [9, 15]. Regardless of the choice of ISA-level MCM, a hardware implementation of a multicore architecture must satisfy its MCM in all possible cases, or the correct execution of parallel programs on that architecture cannot be guaranteed. With concurrent systems becoming both more prevalent and more complex, MCM-related bugs are more common than ever. Intel processors have experienced two transactional memory bugs in recent years [23, 52]. In another case, ambiguity about the ARM ISA-level MCM specification led to a bug where neither hardware nor software was enforcing certain orderings [4, 5]. With no cost-effective hardware fix available, ARM had to resort to a low-performance compiler fix instead. MCM-related issues have also surfaced in concurrent GPU programs [3, 45]. Finally, the computer security community has recognized that MCM bugs may lend themselves to security exploits [22]. Verifying an MCM s correct specification and implementation is increasingly critical given its fundamental importance. System verification is a general challenge, with verification costs now dominating total hardware design cost [20]. MCM verification is particularly challenging since it requires analysis across many possible interleavings of events. Testing all possible combinations of inputs to a system is infeasible, and dynamic testing of a design in simulation will by definition be incomplete and not capture all possible interleavings, even for the tested programs. True completeness in MCM verification would entail spanning the stack from high-level languages (HLLs), through compiler mappings, the ISA, and ultimately to underlying hardware implementations. Prior work has enabled tractable automated memory model verification for provided litmus tests 1 from HLLs to axiomatic microarchitectural specifications [4, 31, 32, 35, 49], but not deeper down to RTL. Meanwhile, more generic non-mcm RTL verification is also maturing [41, 47, 48], often based on languages for specifying temporal assertions such as Property Specification Language (PSL) [26] and SystemVerilog Assertions (SVA) [27]. There remains, however, a gap between early-stage microarchitectural MCM verification [31] and lower-level RTL verification. If axiomatic microarchitectural MCM verification could be linked to the temporal verification approaches of RTL, this link would allow full-stack HLL to RTL checking of a hardware and software system s memory ordering behavior. It 1 Small programs used to test MCM implementations.

MICRO-50, October 14 18, 2017, Cambridge, MA, USA would also encourage the use of early-stage (e.g., microarchitecturelevel) MCM verification techniques by enabling and facilitating their subsequent testing against RTL designs when available. To fill the MCM verification gap between microarchitecture and RTL, this paper proposes RTLCheck, a methodology and tool for checking that microarchitecture-level axioms are upheld by underlying RTL across a suite of litmus tests. Given an axiomatic microarchitectural specification, an RTL design, and a mapping between the axiomatic representation s abstract primitives and the corresponding RTL signals and values, RTLCheck automatically generates SystemVerilog Assertions (on a per-test basis) and inserts them into the RTL design. These assertions lower the abstract microarchitectural axioms into concrete temporal assertions and assumptions that help establish that the design s memory ordering behavior meets the microarchitectural specification. RTLCheck then uses JasperGold [12] (a commercial RTL verification tool from Cadence Design Systems, Inc.) to check these assertions. The results of this verification say whether the asserted properties have been proven for a given test, whether they have been proven for the test up to a bounded number of cycles, or if a counterexample (execution trace that does not satisfy the property) has been found. If a counterexample is found, a discrepancy exists between the microarchitectural specification and RTL, likely due to a bug in the implementation. RTLCheck s approach is incomplete in the sense that it only ensures that the litmus tests verified actually run correctly, rather than checking all programs. In other words, it is possible for a bug to still exist in the design even if the design has been verified across a suite of litmus tests. Nonetheless, automated litmus test-based approaches like RTLCheck have the benefit of concentrating verification effort on the scenarios most likely to exhibit bugs, and are viewed as effective and efficient tools commonly included in MCM verification approaches [4, 24, 31]. This paper demonstrates RTLCheck s usage on a multicore version of the RISC-V V-scale open-source processor design 2 [42]. In doing so, we discover a bug in the V-scale processor s memory implementation. After fixing the bug, we use RTLCheck to show that our multicore V-scale RTL satisfies a set of microarchitectural axioms that are sufficient to guarantee sequential consistency for 56 litmus tests. Overall, the contributions of our work are as follows: We present an automated flow from axiomatic microarchitectural ordering specifications to temporal assertions about RTL design operation for a variety of litmus tests. The automated axiomatic-to-temporal translation is complicated by the inherent differences between the two approaches. The generated assertions are inserted into RTL to support MCM verification of the RTL, narrowing the verification gap between microarchitecture-level MCM checkers and underlying RTL. Our work demonstrates that using RTLCheck s automatically generated assertions in RTL verification can be efficient and tractable. In particular, RTLCheck s assertion and assumption generation phase takes just seconds. The subsequent property verifier discovers complete proofs (i.e. true for all possible traces of a given litmus test) for 89% of the 2 V-scale was deprecated in the time between this work s submission and publication [34], but remains an interesting case study. Y. A. Manerkar et al. Core 0 Core 1 Core 2 Core 3 IF DX WB IF DX WB Arbiter Memory IF DX WB IF DX WB Figure 1: The Multi-V-scale processor: a simple multicore processor with four three-stage in-order pipelines. The arbiter allows only one core to access memory at a time. Core 0 Core 1 (i1) [x] 1 (i3) r1 [y] (i2) [y] 1 (i4) r2 [x] Under SC: Forbid r1=1, r2=0 Figure 2: Code for litmus test mp generated SVA properties in 11 hours of runtime, and can generate bounded proofs (i.e. true for all possible test traces up to a certain number of cycles) for the remaining properties. RTLCheck can also be used for iterative verification, allowing implementers to refine their design and its specification with respect to meeting MCM requirements. Our method is the first technique for RTL MCM checking that applies generally to an arbitrary Verilog design. Other work has formally proven the MCM correctness of specific RTL designs [51] for all programs. Our work is complementary: it proves correctness for litmus tests rather than all programs, but does not impose restrictions on the RTL design to be verified. Furthermore, it supports arbitrary ISA-level MCMs, including ones as sophisticated as x86-tso, ARM, and IBM Power [6, 25, 28]. With the link from microarchitecture to RTL covered by RTLCheck, the Check suite [33] can now support MCM verification from HLLs (C11, etc.) through compiler mappings, the OS, ISA, and microarchitecture, all the way down to RTL. 2 MOTIVATING EXAMPLE 2.1 Microarchitectural Verification Background Figure 1 shows the Multi-V-scale processor, a simple multicore where each core has a three-stage in-order pipeline. Instructions in these pipelines first go through the Fetch (IF) stage, then a combined Decode-Execute (DX) stage, and finally a Writeback (WB) stage where data is returned from memory (for loads) or sent to memory (by stores). An arbiter enforces that only one core can access data memory at any time. The read-only instruction memory

RTLCheck: Verifying the Memory Consistency of RTL Designs MICRO-50, October 14 18, 2017, Cambridge, MA, USA (i1) St [x], 1 Core 0 Core 1 (i2) St [y], 1 Fetch DecodeExecute Writeback (i3) Ld [y] = 1 (i4) Ld [x] = 0 (a) µhb graph for the SC-forbidden outcome of Figure 2 s mp litmus test on Figure 1 s processor. The cycle in this graph shows that this scenario is correctly unobservable at the microarchitecture level. Axiom "WB_FIFO": forall microops "a1", "a2", (OnCore c a1 /\ OnCore c a2 /\ ~SameMicroop a1 a2 /\ ProgramOrder a1 a2) => EdgeExists((a1,DX)), (a2,dx))) => AddEdge((a1,WB)), (a2,wb))). (b) Axiom expressing that the WB stage should be FIFO. always @(posedge clk) begin if (reset (stall_dx & ~stall_wb)) begin // Pipeline bubble PC_WB <= `XPR_LEN'b0; store_data_wb <= `XPR_LEN'b0; alu_out_wb <= `XPR_LEN'b0; end else if (~stall_wb) begin //Update WB pipeline registers PC_WB <= PC_DX; store_data_wb <= rs2_data_bypassed; alu_out_wb <= alu_out; csr_rdata_wb <= csr_rdata; dmem_type_wb <= dmem_type; end end (c) Verilog RTL responsible for updating WB pipeline registers. Figure 3: Illustration of the verification gap between microarchitectural axioms and underlying RTL. The axiom in (b) states that the processor WB stage should be FIFO. Sets of such axioms can be used to enumerate families of µhb graphs such as the one in (a). RTLCheck translates axioms such as (b) to equivalent temporal properties at RTL. These properties can be verified to ensure that Verilog such as that in (c) upholds the microarchitectural axioms for a given test. (not shown) is concurrently accessed by all cores. This processor is simple enough that it appears to implement sequential consistency 3 (SC) [29]. As a case study in this paper, our goal is to check that its 3 In SC, execution results are consistent with there being a total order on all memory operations where each load returns the value of the last store to the same address in this total order. RTL (Verilog) correctly satisfies SC for the litmus tests we examine. (Section 5 provides further details about the Multi-V-scale design.) One litmus test to check for SC is the message-passing (mp) litmus test shown in Figure 2. A multiprocessor that implements SC will forbid the outcome of r1 = 1, r2 = 0 for this code, as there is no SC order on memory operations that allows this outcome. Our approach needs to check that a provided RTL design will never produce executions exhibiting this forbidden outcome. The Check suite [33] has developed methods for conducting microarchitectural MCM verification by exhaustively enumerating and checking microarchitectural happens-before (µhb) graphs that represent all possible executions for a given litmus test [31, 32, 35, 49]. RTLCheck extends the microarchitectural verification of µhb graphs to the level of RTL. Figure 3a shows an example µhb graph for the mp litmus test running on two cores, each with the three-stage pipeline mentioned above. Each column of nodes represents a different instruction in the litmus test. The left two columns correspond to i1 and i2 executing on core 0. The right two columns correspond to i3 and i4 executing on core 1. Nodes in µhb graphs represent the individual microarchitectural events or pipeline stages in a particular instruction s execution on that microarchitecture. For example, the node in the bottom-right of the graph represents instruction i4 at its Writeback stage, while the left-most node in the second row represents instruction i1 at its Decode-Execute stage. Edges between µhb nodes represent known happens-before relationships between those nodes, and are added based on ordering axioms that the designer specifies a correct microarchitecture should respect. As each edge represents a happens-before relationship, a cycle in the graph indicates that the depicted scenario cannot occur. This is because a cycle would imply that an event must happen before itself, which is impossible. Thus, the strategy with verification based on µhb graphs is to consider and cycle-check all possible scenarios. If the specification indicates that a particular outcome should be forbidden, then all µhb graphs for that test outcome on that microarchitecture must be cyclic. Many µhb graphs are typically possible for each litmus test and microarchitecture, since parallel programs generally have many possible executions. Ordering rules specifying which edges are added and when are described in terms of axioms such as the one in Figure 3b. This axiom is written in µspec, a first-order logic-based modeling language used by the Check suite [33]. This axiom states that if the DX stage of an instruction a1 happens before the DX stage of an instruction a2 that is later in program order on the same core, then the WB stage of a1 must also happen before the WB stage of a2. For mp, under SC, an outcome of r1 = 1, r2 = 0 should be forbidden on the processor from Figure 1. The microarchitectural analysis to verify this is performed as follows. First, for this processor s in-order pipelines, instructions proceed through the DX and WB stages in the order in which they were fetched. This behavior is represented by edges in the graph, including those between the WB stages of i1 and i2, as well as between the WB stages of i3 and i4 (both added by the WB_FIFO axiom in Figure 3b). Other axioms can specify other ordering relationships that the designer stipulates for the implementation. For example, in order for i4 to return a value of 0 for its load of x, it must complete its Writeback

MICRO-50, October 14 18, 2017, Cambridge, MA, USA Y. A. Manerkar et al. stage before the store of x in i1 reaches its Writeback stage and goes to memory, thus overwriting the old value of 0. This ordering is represented by the red edge from i4 s Writeback node to i1 s Writeback node in Figure 3a. Likewise, in order for i3 s load of y to return a value of 1, it must occur after the store of y in i2 reaches the Writeback stage, shown by another red edge between those two nodes. The combination of the four thicker red and blue edges creates a cycle in the graph. Thus, this execution of mp is (correctly) not observable. 2.2 From Microarchitectural Models to RTL The microarchitectural verification conducted by tools such as the Check suite is only valid if each of the individual ordering axioms is actually upheld by the underlying RTL. Thus, there is a need to verify that these axioms are respected by the RTL of the design. If they are not, then any microarchitectural verification assumes incorrect orderings and is invalid. The key contribution of RTLCheck lies in verifying (on a per-test basis) that a given RTL design actually upholds the microarchitecturelevel ordering axioms as specified. This verification is difficult due to the semantic mismatches between axiomatic µspec microarchitecture definitions and the temporal assertions used to verify RTL. Section 3 describes this semantic mismatch in greater detail, and then Section 4 discusses our approaches for overcoming these challenges. A key contribution of this paper is to identify an approach to writing µspec that is synthesizable to SVA, much as previous work has spent effort to identify subsets of Verilog that are synthesizable to actual circuits. We expect that in the future, µspec axioms will become more tailored to restrict themselves more carefully to rules that are in fact synthesizable to SVA. Returning to our example, Figure 3c shows the portion of the processor s Verilog RTL that updates the WB pipeline registers from the DX pipeline registers. The Verilog makes it appear as though instructions do indeed move to WB in the order in which they entered DX, but in general, it is challenging to tell from inspection whether this is always the case. For instance, what if an interrupt occurs between the two instructions? The axioms to be verified can be notably more complex than the one in Figure 3b. In such cases, without automated verification like RTLCheck, it is harder, if not impossible, to tell whether they are always upheld for a given test. 3 THE SEMANTIC GAP BETWEEN µhb GRAPHS AND TEMPORAL LOGIC Axiomatic and temporal models can both be used to verify that systems correctly satisfy MCMs, but they do so in starkly different ways. These differences lead to challenges when translating between the two, which RTLCheck needs to do in its translation of µspec to SVA. This section highlights the challenges faced when translating between the axiomatic world of µspec and the temporal world of SVA. We begin by providing some background on the differences between axiomatic and temporal approaches. 3.1 Axiomatic/Temporal Differences Consider an abstract machine atomic_mach which performs instructions atomically and in program order. Figure 4 shows the verification of the mp litmus test from Figure 2 on atomic_mach using both axiomatic (Figure 4a) and temporal (Figure 4b) models of the machine. This verification aims to check whether mp s forbidden outcome (r1=1,r2=0) is possible on atomic_mach. Axiomatic verification conceptually generates all possible executions and then checks each one for correctness. In the case of mp, there are four possible executions, each with a different outcome, as shown in Figure 4a. To check an execution for correctness, each axiom is applied to the execution as a whole. An axiom has omniscience of the execution it is being applied to; in other words, it can see all events in the execution. The axiom for SC on atomic_mach checks that an execution does not contain cycles in the combination of the po (program order), rf (reads-from), fr (from-reads), and co (coherence order) relations. Figure 4a shows such a cycle in the execution at its bottom-right, violating the axiom. The cycle indicates that this execution is impossible on atomic_mach. Figure 4a depicts this by a blue strike through the execution. Since axiomatic approaches examine executions as a whole, they can also efficiently reason about the outcome of an execution. This makes it simple to exclude executions on the basis of their outcome. The remaining three executions in Figure 4a have outcomes distinct from the forbidden outcome under test (r1=1,r2=0), so they can be excluded from consideration when checking for the presence of the forbidden outcome. This exclusion is shown in Figure 4a by the dashed red strikes through these executions. There are no remaining executions that satisfy both the axiom for SC and the outcome under test. This indicates that the forbidden outcome of mp is correctly unobservable on atomic_mach. Temporal approaches, meanwhile, conceptually generate and check executions step-by-step. As Figure 4b shows, temporal verification generates all possible first steps from a start state (the black dot). From each valid first step that is generated, the verifier generates all possible next steps, and so on, generating a tree of executions. For atomic_mach in Figure 4b, a step constitutes performing a single instruction atomically, while at the level of RTL, a single step is a clock cycle. Executions correspond to paths from the start state through the tree. These paths can represent partial executions of the test (such as execution p in Figure 4b), which perform only some of the instructions of the test. They can also represent full executions of the test which perform all test instructions (such as execution f in Figure 4b). The temporal properties required for SC on atomic_mach are listed in Figure 4b, and are noticeably different from the corresponding axiomatic description. Together, the three properties enforce that instructions perform in program order, and that loads read the value of the last store to their address. Temporal properties are specified in terms of steps rather than executions, and are thus applicable to both partial and full executions. If a step generated by the execution tree contravenes one of the properties of the model, that branch of the tree is eliminated as being impossible on atomic_mach. For instance, performing i2 from the start state violates property 1 in Figure 4b, as indicated by the corresponding blue strike-through. Unlike in the axiomatic case, filtering outcomes is quite difficult for temporal methods. Temporal assumptions can be used to constrain the explorations of a temporal verifier to the outcome

RTLCheck: Verifying the Memory Consistency of RTL Designs MICRO-50, October 14 18, 2017, Cambridge, MA, USA Core 0 Core 1 (i1) [x] 1 (i2) [y] 1 (i3) r1 [y] (i4) r2 [x] Outcome: r1 = 0, r2 = 0 Core 0 Core 1 (i1) [x] 1 (i2) [y] 1 Axiom for SC: acyclic(po U rf U co U fr) (i3) r1 [y] (i4) r2 [x] Outcome: r1 = 1, r2 = 1 Core 0 Core 1 (i1) [x] 1 (i2) [y] 1 (i3) r1 [y] (i4) r2 [x] Outcome: r1 = 0, r2 = 1 Core 0 Core 1 (i1) [x] 1 (i3) r1 [y] po po (i2) [y] 1 rf fr(i4) r2 [x] Outcome: r1 = 1, r2 = 0 (a) Axiomatic verification of mp on the abstract machine atomic_mach. All possible executions are generated, and then each is checked as a whole. Executions with an outcome not under test are eliminated (dashed red strikes), as are executions that violate axioms (blue strikes). Temporal Properties for SC: 1. @step (Perform instr <i> -> instrs po-before i performed) 2. @step (Perform Ld <addr> -> Ld returns Mem[addr]) 3. @step (Perform St <addr> <val> -> Mem[addr] = val) (i2) [y] 1 (i3) r1 [y] = 0 (i1) [x] 1 (i2) [y] 1 (i3) r1 [y] = 1 (i4) r2 [x] = 1 Step 1 Step 2 Step 3 Step 4 (i4) r2 [x] = 0 Partial execution p Full execution f (b) Temporal verification of mp on the abstract machine atomic_mach. Executions are generated and checked step-by-step as a tree. Steps exhibiting an outcome not under test are eliminated (dashed red strikes), as are steps that violate properties (blue strikes). -> denotes temporal implication. Figure 4: Illustration of the differences between axiomatic and temporal approaches when used to check for the forbidden outcome of mp (r1=1,r2=0) on the abstract machine atomic_mach, which performs instructions atomically and in program order. The forbidden outcome is found to be unobservable in both cases. under test, but the step-by-step generation of executions makes it difficult for the verifier to check whether a given step causes the future violation of an assumption. Consider a temporal assumption for mp which enforces that the load of x returns 0 as per the outcome under test. Performing the store to x in i1 as step 1 makes it impossible for the load of x in i4 to return 0 as required by the assumption. (Doing so would now violate property 2.) However, the only way the temporal verifier can (conceptually) deduce this is to examine all possible paths from i1 to the end of the execution, and check if any of them contain a step where i4 returns 0. This check is computationally intensive in the general case. Due to this cost, many SVA verifiers (including JasperGold, the verifier we use in this paper) do not check for the future violation of assumptions in this manner, contrary to actual SVA requirements 4. Instead, they only guarantee that assumptions are not violated up to the present step. Returning to atomic_mach, the lack of a check for future violation of assumptions means that executions cannot be filtered by test outcome. Instead, branches of the execution tree that violate assumptions are only removed from consideration after the offending event actually occurs. For example, in the execution tree of Figure 4b, the branch where the load of x in i4 returns 1 is only eliminated when i4 actually occurs at step 4 (as indicated by the corresponding dashed red strike-through), even though this outcome is enforced by the prior occurrence of the store of 1 to x at step 1. Meanwhile, the step where i4 returns 0 in the displayed branch is impossible as it violates property 2 from Figure 4b. Since there is no complete execution that satisfies both the properties of the model and the outcome under test, the temporal verification deduces that the forbidden outcome of mp is correctly unobservable on atomic_mach. However, any properties of the model must still match partial executions like p, despite the fact that p cannot be 4 Checking for future violation of SV assumptions requires complicated liveness checks even for simpler safety properties. See Cerny et al. [14] for further details. extended to a full execution that satisfies both the model and the outcome under test. The rest of this section explains the specific semantic mismatches between axiomatic µspec and temporal SVA that are relevant to RTLCheck, and the following section explains how RTLCheck surmounts these challenges. 3.2 µspec Omniscience and Load Values Figure 5 shows a µspec axiom (and related macros) from the Multi- V-scale microarchitecture definition. This axiom splits the checking of load values into two categories: those which read from some write in the execution, and those which read from the initial value of the address in memory. The axiom is litmus-test-independent: it applies equally to any program that runs on the microarchitecture being modeled. However, for efficiency, the microarchitectural verification of the Check suite [31] uses omniscience about a proposed litmus test outcome to prune out all logical branches of the axiom which do not result in that outcome. Consider the application of this axiom to the load of x from mp, which returns 0 in the outcome under test. In µspec, the SameData w i predicate evaluates to true if instructions w and i have the same data in the litmus test outcome, while DataFromInitialStateAtPA i returns true if i reads the initial value of a memory location. For mp, the Check framework considers the entire execution at once (like the axiomatic verifier in Figure 4a), and evaluates all data-related predicates (highlighted in red in Figure 5) according to the outcome specified by the litmus test. For the load of x, the Check suite evaluates the SameData w i predicate in the NoInterveningWrite part of the axiom to false (as there is no write which stores 0 to x in mp). This causes the NoInterveningWrite /\ BeforeOrAfterEveryWrite portion of the Read_Values axiom to evaluate to false. Thus, the body of the Read_Values axiom is reduced to BeforeAllWrites, and Check knows from the start that it must add an edge indicating that Ld x @WB hb St x @WB (shown as one of the red edges in Figure 3a).

MICRO-50, October 14 18, 2017, Cambridge, MA, USA DefineMacro "BeforeAllWrites": DataFromInitialStateAtPA i /\ forall microop "w", ( (IsAnyWrite w /\ SameAddress w i /\ ~SameMicroop i w) => AddEdge ((i, Writeback), (w, Writeback), "fr", "red")). DefineMacro "NoInterveningWrite": exists microop "w", ( IsAnyWrite w /\ SameAddress w i /\ SameData w i /\ EdgeExists ((w, Writeback), (i, Writeback)) /\ ~(exists microop "w'", IsAnyWrite w' /\ SameAddress i w' /\ ~SameMicroop w w' /\ EdgesExist [((w, Writeback), (w', Writeback), ""); ((w', Writeback), (i, Writeback), "")])). Y. A. Manerkar et al. DefineMacro "BeforeOrAfterEveryWrite": forall microop "w", ( (IsAnyWrite w /\ SameAddress w i) => (AddEdge ((w, DecodeExecute), (i, DecodeExecute)) \/ AddEdge ((i, DecodeExecute), (w, DecodeExecute)))). Axiom "Read_Values": forall microops "i", OnCore c i => IsAnyRead i => ( ExpandMacro BeforeAllWrites \/ (ExpandMacro NoInterveningWrite /\ ExpandMacro BeforeOrAfterEveryWrite)). Figure 5: µspec axiom enforcing orderings and value requirements for loads on the Multi-V-scale processor. Predicates relevant to load values are highlighted in red. The edge referred to in Section 3.3 is highlighted in blue. The use of omniscience poses the first major difficulty when translating µspec to SVA. As explained in Section 3.1, SVA verifiers do not check for future violation of assumptions. A temporal assumption stipulating that the load of x returns 0 for mp would only take effect from the cycle where the load of x receives its value. Thus, nothing would stop the temporal verification from creating a partial execution where the store of x reached its WB stage before the load of x did so (similar to partial execution p from Figure 4b). Since temporal properties must match all partial execution traces, the translation of the Read_Values axiom from Figure 5 must match this partial execution. If the Check suite s omniscience-based axiom simplification is conducted before translation to SVA, the equivalent SVA property would simply be a translation of Before- AllWrites, which requires Ld x @WB hb St x @WB. Such a property would fail to match the partial execution, which has St x @WB hb Ld x @WB. This naive property would incorrectly report an RTL bug despite the design actually respecting microarchitectural orderings! In short, the limitations of SVA verifiers prevent RTLCheck from enforcing the specified outcome of a litmus test at RTL. Thus, RTL verifiers will check (partial) executions corresponding to all possible outcomes of the litmus test. To overcome this challenge, properties generated by RTLCheck that involve checking the value of a load need to account for all outcomes of the litmus test, not just its specified outcome. Section 4.2 describes how RTLCheck generates such properties from µspec axioms. 3.3 Concretizing Abstract µhb Edges An ordering edge src hb dest at the level of a µhb graph is a statement that src happens before dest for the execution in question. It says nothing about when src and dest occur in relation to other events in the execution, nor does it specify the duration of the delay between the occurrences of src and dest. A naive translation of the axiomatic edge to the temporal semantics of SVA would be to use the standard SVA mechanism of unbounded ranges (Section 4.3 discusses the mapping of the src and dest nodes themselves): ##[0:$] <src> ##[1:$] <dest> This SVA sequence (specification of signal values over multiple clock cycles) allows an initial delay of 0 or more cycles (##[0:$]) before the occurrence of src, since src may not occur at the first Core[0].DX Core[0].WB Core[1].DX Core[1].WB clk Core[1].LData St x St y St x St y Core[0].SData 0x1 0x1 2 3 4 5 6 7 Ld y Ld x Ld y 0x1 Ld x Figure 6: Example execution trace of mp on Multi-V-scale where Ld y returns 1 and Ld x returns 1. Signals relevant to the events St x @WB and Ld x @WB are underlined and in red. cycle in the execution. It also includes an intermediate delay of 1 or more cycles (##[1:$]) between src and dest, as the duration of the delay between src and dest is not specified by the microarchitectural model. Unfortunately, this standard mechanism is insufficient for checking that src does indeed happen before dest for all executions examined. Consider in isolation the edge in blue from Figure 5 in BeforeAllWrites enforcing that Ld x @WB hb St x @WB, with a constraint that the load of x must return 0. At the same time, consider the execution trace of mp in Figure 6 which reflects the outcome where St x @WB hb Ld x @WB and the load returns 1. (The relevant signal values are underlined and in red in Figure 6.) Since Figure 6 s execution has the events occurring in the opposite order and the load values are different, Figure 6 should serve as a counterexample to the property checking the edge from Before- AllWrites. However, if one simply uses the straightforward mapping above (i.e. ##[0:$] <Ld x=0 @WB> ##[1:$] <St x@wb>), Figure 6 is not a counterexample for the property! The reason that Figure 6 is not a counterexample is that the unbounded ranges can match any clock cycle, including those which 0x1

RTLCheck: Verifying the Memory Consistency of RTL Designs contain events of interest like the source and destination nodes of the edge. For this particular example, the initial ##[0:$] can match the execution up to cycle 5. At cycle 6, since the load of x returns 1, Ld x=0 @WB does not occur, so the execution cannot match that portion of the sequence. However, nothing stops the initial delay ##[0:$] from being extended another cycle and matching cycles 0-6. Indeed, even the entire execution can match ##[0:$], thus satisfying the property 5. Since Figure 6 s execution never violates the sequence, it is not a counterexample to the property. To address this problem of incorrect delay cycle matches, the conditions on the initial and intermediate delays must be stricter to stipulate that they are in fact repetitions of clock cycles where no events of interest occur. Section 4.3 describes the specific SVA constructs RTLCheck uses to represent µhb edges in this manner. 3.4 Fire-Once vs. Fire-Always Assertions An SV assertion or assumption has a clock signal which dictates what it considers a cycle. On every cycle of this clock, any properties move temporally forward one cycle. For example, if checking the following assertion with respect to the execution in Fig. 6 6 : assert property (@(posedge clk) ##2 <St x@wb>); one would expect the property to return true, as the WB stage of the store to x indeed occurs two cycles after the beginning of the execution. (Figure 6 elides the execution s first cycle for brevity.) However, SVA semantics are not that simple. In addition to dictating when the assertion should move forward one cycle, each cycle of clk also instantiates a unique copy of the assertion which checks the entire property from beginning to end, but starting from the current clock cycle. In other words, one match attempt of the property begins at cycle 1 and checks for St x @WB at cycle 3. Another match attempt begins at cycle 2 and checks for St x @WB at cycle 4, and so on for every cycle in the execution. If any of these match attempts fail, the entire property is considered to have failed. Here, the match attempt that begins at cycle 2 will fail, as St x will not be in WB at cycle 4 as the property requires. The conceptual mismatch here is that the SVA notion of an assertion (taken as a whole) is something that holds true starting from every cycle in an execution, whereas a microarchitecture-level happens-before axiom is merely enforced once with respect to an execution as a whole. To bridge this gap, the properties generated by RTLCheck must explicitly filter out match attempts that do not start at the beginning of the execution. Section 4.4 describes our mechanism for doing so. 4 TRANSLATING µspec AXIOMS TO SVA Figure 7 shows the high-level flow of RTLCheck. Three of the primary inputs to RTLCheck are the RTL design to be verified, a µspec model of the microarchitecture, and a suite of litmus tests to verify the design against. The other inputs to RTLCheck are the program and node mapping functions (described in Sections 4.1 and 4.3 respectively). Program and node mapping functions translate litmus tests and µhb nodes to initial/final state assumptions and equivalent RTL expressions respectively. RTLCheck has two main 5 At an intuitive level, some readers may find this logic similar to regular expression matching. 6 ##<i> specifies a delay of i cycles Program Mapping Function RTL Design MICRO-50, October 14 18, 2017, Cambridge, MA, USA RTLCheck Litmus Test Assumption Generator -Memory Initialization -Register Initialization -Value Assumptions Temporal SV Assumptions µspec Microarch. Axioms RTL Verifier (e.g. JasperGold) Result: Property Proven? Counterexample provided? Node Mapping Function Assertion Generator -Outcome-Aware Translation -Edge Mapping -Filtering Match Attempts Temporal SV Assertions Figure 7: Overall flow diagram of RTLCheck. components. The Assumption Generator (Section 4.1) generates SV assumptions constraining the executions examined by a verifier to those of the litmus test being verified. The Assertion Generator (Section 4.2) generates SV assertions that check the individual axioms specified in the µspec model for the specific litmus test while surmounting the axiomatic/temporal mismatches discussed in the previous section. While this section focuses on µspec microarchitectural axioms and temporal SV assertions at RTL, the flow would be similar for other microarchitectural axiom formats and other temporal property languages. 4.1 Assumption Generation RTLCheck s generated properties are litmus test-specific, so the executions examined by an RTL verifier for these properties need to be restricted to the executions of the litmus test in question. As seen in Figure 7, the Assumption Generator performs this task using a program mapping function provided by the user. Program mapping functions link a litmus test s instructions, initial conditions, and final values of loads and memory to RTL expressions representing these constraints on the implementation to be verified. The parameters provided to a program mapping function are the litmus test instructions, context information such as the base instruction ID for each core, and the initial and final conditions of the litmus test. The assumptions generated for a given litmus test must accomplish three tasks: (1) Initialize data and instruction memories to the litmus test s initial values and instructions respectively. (2) Initialize registers used by test instructions to appropriate address and data values. (3) Enforce that the values of loads and the final state of memory respect test requirements in generated RTL executions. Figure 8 shows a subset of the assumptions that must be generated for the mp litmus test from Figure 2 for Multi-V-scale. Memory Initialization: The first assumption in Figure 8 is an example of data memory initialization. It sets x to its initial value of 0 as required by mp. The assumption uses SVA implication ( ->) triggered on the first signal being 1. An SVA implication a -> b first checks a. If a evaluates to true, the implication then checks

MICRO-50, October 14 18, 2017, Cambridge, MA, USA Y. A. Manerkar et al. assume property (@(posedge clk) first -> mem[21] == {32'd0}); assume property (@(posedge clk) first -> mem[1] == {7'b0,5'd2,5'd1,3'd2,5'b0,`RV32_STORE}); assume property (@(posedge clk) (((core[1].pc_wb == 32'd24 && ~(core[1].stall_wb)) -> (core[1].pc_wb == 32'd24 && ~(core[1].stall_wb) && core[1].load_data_wb == 32'd1)) and ((core[1].pc_wb == 32'd28 && ~(core[1].stall_wb)) -> (core[1].pc_wb == 32'd28 && ~(core[1].stall_wb) && core[1].load_data_wb == 32'd0)))); assume property (@(posedge clk) (((core[0].halted == 1'b1 && ~(core[0].stall_wb)) && (core[1].halted == 1'b1 && ~(core[1].stall_wb)) && (core[2].halted == 1'b1 && ~(core[2].stall_wb)) && (core[3].halted == 1'b1 && ~(core[3].stall_wb))) -> (1))); Figure 8: A subset of the SV assumptions RTLCheck generates for mp. Some signal structure is omitted for brevity. or enforces b beginning that same cycle. The first signal is autogenerated by the Assumption Generator, and is set up so that it is 1 in the first cycle after reset and 0 on every subsequent cycle. Thus, the assumption only enforces that the value of the address in memory is equivalent to the initial condition of the litmus test at the beginning of the execution. This distinction is necessary as the verification needs to allow the address to change value as an execution progresses. (The first signal is also used to filter match attempts as Section 4.4 describes.) The second assumption is an instruction initialization assumption. It enforces that core 0 s first instruction is the store that is (i1) in Figure 2 s mp code. Register Initialization: The assembly encoding of litmus test instructions uses registers for addresses and data. Register initialization assumptions set these registers to the correct addresses and data values at the start of execution. They are similar in structure to memory initialization assumptions. Value Assumptions: Load value assumptions cannot be used to enforce an execution outcome (see Section 3.1), but they can still be used to guide the verifier and reduce the number of executions it needs to consider. The third assumption in Figure 8 contains two such implications. Each one checks for the occurrence of one of the loads in mp and enforces that it returns the value in the test outcome (0 and 1 for x and y respectively) when it occurs. The last assumption in Figure 8 is a final value assumption. It contains an implication whose antecedent is the condition that all cores have halted their Fetch stages and all test instructions have completed their execution. The consequent of the implication stipulates any final values of memory locations that are required by the litmus test. Since mp does not enforce any such requirements, the consequent of the implication is merely a 1. A pleasantly surprising side effect of assumption generation is that for certain tests, assumptions alone turn out to be sufficient in practice to verify the RTL. For most assumptions, JasperGold (the commercial RTL verifier used by RTLCheck) can find covering traces, which are traces where the assumption condition occurs and can be enforced. For instance, a covering trace for an assumption enforcing that the load of y returns 1 would be a partial execution where the load of y returns 1 in the last cycle. A covering trace for a final value assumption in particular would by definition contain the execution of all instructions in the test. The covering trace must also obey any constraints on instruction execution stipulated by other assumptions, including load value assumptions. As such, a covering trace for mp s final value assumption is an execution where the load of y returns 1 and the load of x returns 0. A search for such a trace is equivalent to finding an execution where the entire forbidden outcome of mp occurs. If JasperGold can prove that a covering trace for an assumption does not exist, it will label the assumption as unreachable. An unreachable final value assumption means that there are no executions satisfying the test outcome. This result verifies the RTL for that litmus test without checking the generated assertions. Thus, a final value assumption forces JasperGold to try and find a covering trace of the litmus test outcome, possibly leading to quicker verification. As such, final value assumptions are beneficial even when the test does not specify final values of memory, but we expect this benefit to be largest in relatively small designs. Section 7.2 quantifies our results. 4.2 Outcome-Aware Assertion Generation As Figure 7 shows, the Assertion Generator is responsible for translating µspec axioms to SV assertions checking the corresponding properties at RTL. The Assertion Generator translates µspec primitives like /\ and \/ to their SVA counterparts and and or. It reuses most of the Check suite s µspec parsing engine and axiom simplification, but its translation must account for the axiomatic-temporal mismatch as outlined in Section 3. Section 3.2 shows that the assertions generated by RTLCheck must account for all possible outcomes of the litmus test, not just its specified outcome. For example, the assertion generated for the axiom in Figure 5 for mp s load of x must account for both the case where the load of x returns 1 and the case where it returns 0. (The load of x cannot return any other values in an execution of mp.) If the load of x returns the initial value of 0, this corresponds to the BeforeAllWrites portion of the axiom. The SVA property for this case must check that Ld x @WB hb St x @WB and that the load returns 0. Similarly, if the load returns 1, this corresponds to the part of the axiom comprising NoInterveningWrite and BeforeOrAfterEveryWrite. The equivalent SVA property here must check that St x @WB hb Ld x @WB and that the load returns 1. To accomplish this outcome-aware translation, the mapping onto RTL of any µhb edge containing nodes of load instructions must account for any constraints the edge s axiom enforces on the values of those loads. (These constraints are henceforth referred to as load value constraints.) For mp s load of x, the µhb edge in Figure 5 s BeforeAllWrites macro requires DataFromInitialStateAtPA i to be true, which in turn requires that the load returns the initial value 0. This requirement on the load s value is stipulated as a load value constraint when mapping the edge from BeforeAllWrites. Similarly, the µhb edges in Figure 5 s NoInterveningWrite and

RTLCheck: Verifying the Memory Consistency of RTL Designs fun mapnode(node, context, load_constr) := let pc := getpc(node.instr, context) in let core := node.core in match node.stage with IF => return "core[" + core + "].PC_IF == " + pc + " && ~stall_if" DX => return "core[" + core + "].PC_DX == " + pc + " && ~stall_dx" WB => let lc := get_lc(node, load_constr) in let str := "core[" + core + "].PC_WB == " + pc + " && ~stall_wb" in if lc!= None then str += (" && load_data_wb == " + lc.value) return str Figure 9: Multi-V-scale node mapping function pseudocode. BeforeOrAfterEveryWrite macros require SameData w i to be true. In the case of mp s load of x, this predicate enforces that the load returns 1, and this requirement is also encoded as a load value constraint when mapping the edge. The mapped edges of the two cases are combined with an SVA or (translated from the µspec \/), allowing the property to cater to both the executions where the load of x returns 0 and the executions where it returns 1, as required by RTL verifiers. In µspec, the DataFromFinalStateAtPA i predicate returns true if i stores a value equivalent to the final value of its address in the litmus test. Accounting for this predicate at RTL requires ensuring that a given write is the last write to a particular address in an execution. However, the inability of SVA verifiers to check for future violation of assumptions also makes it impossible to ensure that a particular write happens last in an RTL execution. Thus, when translating this predicate, assertion generation always (conservatively) evaluates this predicate to false. Doing so ensures that the generated properties check all possible orderings of writes (i.e., a superset of the executions that the Check suite would examine), including the write ordering the litmus test focuses on. 4.3 Mapping µhb Edges to SVA When mapping an edge between two µhb nodes to SVA, RTLCheck requires some notion of what the nodes represent at RTL. This functionality is provided by the node mapping function written by the user and provided to RTLCheck as input. Figure 9 shows pseudocode for a node mapping function for Figure 1 s Multi-Vscale processor. The input parameters for a node mapping function are (i) the node to be mapped, which is a specific microarchitectural event for a specific instruction, (ii) context information, such as the starting program counter (PC) for each core, and (iii) any load value constraints that must be obeyed by the mapping of that node (as described in Section 4.2). The output of the node mapping function is a Verilog expression that corresponds to the occurrence of the node to be mapped in RTL. As shown in Section 3.3, mapping µspec edges using standard SVA unbounded ranges misses bugs. Instead, the conditions on the initial and intermediate delays in an edge mapping must be stricter to stipulate that they are in fact repetitions of clock cycles where no events of interest occur. In this context, an event is of MICRO-50, October 14 18, 2017, Cambridge, MA, USA interest if it matches the node in question (i.e., if it matches the microarchitectural instruction and event), but regardless of the data values themselves. As such, for initial delays and intermediate cycles, we use this sequence 7 : (~(mapnode(src, None) mapnode(dest, None))) [*0:$] No load constraints are passed to the calls to the mapping function to generate the delay sequence. This prevents delay cycles from matching cases where events of interest occur with incorrect values, as in the case of the load of x from Section 3.3. The overall mapping RTLCheck uses for happens-before edges is: (~(mapnode(src, None) mapnode(dest, None))) [*0:$] ##1 mapnode(src, lc) ##1 (~(mapnode(src, None) mapnode(dest, None))) [*0:$] ##1 mapnode(dest, lc) This edge mapping is capable of handling variable delays while still correctly checking the edge s ordering. Finally, some properties require simply checking for the existence of a µhb node, rather than an edge between two nodes. In these cases, we use a similar but simpler strategy to that used for mapping edges. The existence of a node is equivalent to an execution consisting of zero or more cycles where the node does not occur followed by a cycle where it does occur; in other words: (~map(node, None)) [*0:$] ##1 map(node, lc) 4.4 Filtering Match Attempts Section 3.4 showed how naive assertions generate multiple match attempts in contradiction to microarchitectural intent. Only the first match attempt is capable of matching all events from a microarchitectural axiom, so it is the only one that should be checked to match microarchitectural intent. To filter out match attempts other than the first attempt, RTLCheck guards each of its assertions with implications triggered by the first signal, which is auto-generated by the Assumption Generator (Section 4.1). The original assertion in Section 3.4 would attempt a match every cycle: assert property (@(posedge clk) ##2 <St x@wb>); Instead, we guard the assertion with an implication on first: assert property (@(posedge clk) first -> ##2 <St x@wb>); Now, any match attempt that begins at a cycle after the first one will trivially evaluate to true, as the first signal will be 0 and the implication consequent will never be evaluated. Meanwhile, the match attempt beginning at the first cycle will trigger the implication and cause evaluation of the consequent property as required. Putting it all together, Figure 10 shows an example assertion for mp which checks for the existence of an edge Ld x=0 @WB hb St x @WB on Multi-V-scale. Note that the load s WB stage checks that the data value returned is 0, as required by the load value constraint. In general, assertions check multiple edges, and are larger than the one in Figure 10. 5 MULTI-V-SCALE: A SIMPLE MULTICORE This section describes the relevant details of the Multi-V-scale processor, a multicore version of the RISC-V V-scale processor [42]. The 7 <x> [*0:$] represents possibly infinite repetitions of x