# Automated Full-Stack Memory Model Verification with the Check suite 

Yatin Manerkar

Princeton University

ARM Cambridge, July 20th, 2018
http://check.cs.princeton.edu/

## What are Memory (Consistency) Models?

## Memory Consistency Models (MCMs)

Specify rules and guarantees about the ordering and visibility of accesses to shared memory [Sorin et al., 2011].


## What are Memory (Consistency) Models?

## Memory Consistency Models (MCMs)

Specify rules and guarantees about the ordering and visibility of accesses to shared memory [Sorin et al., 2011].


## What are Memory (Consistency) Models?

## Memory Consistency Models (MCMs)

Specify rules and guarantees about the ordering and visibility of accesses to shared memory [Sorin et al., 2011].


## Sequential Consistency (SC) - Interleaving Model

- Defined by [Lamport 1979], execution is the same as if:
(R1) Memory ops of each processor appear in program order
(R2) Memory ops of all processors were executed in some total order (load reads the value of last store to its address in the total order)

Program (mp litmus test)
(all addrs initially 0)

| Core 0 | Core 1 |
| :--- | :--- |
| $x=1$ | $r 1=y$ |
| $y=1$ | $r 2=x$ |

Legal Executions

```
\[
x=1
\]
\[
r 1=y
\]
\[
r 2=x
\]
\[
y=1
\]
```

$$
\begin{array}{ll}
r 1=1 & r 1=0 \\
r 0-1 & r 0=0
\end{array}
$$

$$
\mathrm{r} 2=1
$$

r2=0

Illegal Outcome

| $\mathrm{x}=1$ | $r 1=y$ | $\mathrm{x}=1$ | $\mathrm{x}=1$ | $r 1=y$ | $r 1=y$ |  |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| $y=1$ | $r 2=x$ | $r 1=y$ | $r 1=y$ | $x=1$ | $x=1$ | $r 1=1$ |
| $r 1=y$ | $x=1$ | $r 2=x$ | $y=1$ | $r 2=x$ | $y=1$ | $r 2=0$ |
| $r 2=x$ | $y=1$ | $y=1$ | $r 2=x$ | $y=1$ | $r 2=x$ |  |
|  |  |  |  |  |  |  |
| $r 1=1$ | $r 1=0$ |  | $r 1=0$ | $r 2=1$ |  |  |
| $r 2=1$ | $r 2=0$ |  |  |  |  |  |

## Sequential Consistency (SC) - Interleaving Model

- Defined by [Lamport 1979], execution is the same as if:
(R1) Memory ops of each processor appear in program order
(R2) Memory ops of all processors were executed in some total order (load reads the value of last store to its address in the total order)

Program (mp litmus test)
(all addrs initially 0)

| Core 0 | Core 1 |
| :--- | :--- |
| $x=1$ | $r 1=y$ |
| $y=1$ | $r 2=x$ |

Legal Executions

$\begin{array}{ll}r 1=1 & r 1=0 \\ r 2=1 & r 2=0\end{array}$

Illegal Outcome

| $x=1$ | $r 1=y$ | $x=1$ | $x=1$ | $r 1=y$ | $r 1=y$ |  |
| :--- | :--- | :--- | :--- | :--- | :--- | ---: |
| $y=1$ | $r 2=x$ | $r 1=y$ | $r 1=y$ | $x=1$ | $x=1$ | $r 1=1$ |
| $r 1=y$ | $x=1$ | $r 2=x$ | $y=1$ | $r 2=x$ | $y=1$ | $r 2=0$ |
| $r 2=x$ | $y=1$ | $y=1$ | $r 2=x$ | $y=1$ | $r 2=x$ |  |
|  |  |  |  |  |  |  |
| $r 1=1$ | $r 1=0$ |  | $r 1=0$ | $r 2=1$ |  |  |
| $r 2=1$ | $r 2=0$ |  |  |  |  |  |

## Hardware Implements Weak Memory Models

- Most processors don’t implement SC
- x86: Total Store Order (TSO): Relaxes Write->Read ordering
- ARMv8 and Power relax more orderings
- Compilation to weak memory ISAs must maintain ordering guarantees
- [Owens et al. TPHOLS 2009], [Batty et al. POPL 2011, POPL 2012], [Wickerson et al. OOPSLA 2015], ...


## C11 Source Code

| $\begin{aligned} & \text { atomic<int> } x=0 ; \\ & \text { atomic<int> } y=0 ; \end{aligned}$ |  |
| :---: | :---: |
| Thread 0 | Thread 1 |
| $x$ = 1; | r 1 = y ; |
| $y=1 ;$ | r2 = $x$; |
| C11 Forbids | 1 = 1, r2 |

## Hardware Implements Weak Memory Models

- Most processors don’t implement SC
- x86: Total Store Order (TSO): Relaxes Write->Read ordering
- ARMv8 and Power relax more orderings
- Compilation to weak memory ISAs must maintain ordering guarantees
- [Owens et al. TPHOLS 2009], [Batty et al. POPL 2011, POPL 2012], [Wickerson et al. OOPSLA 2015], ...


## C11 Source Code

| atomic int> $x=0$; <br> atomic:int> y = 0; |  |
| :---: | :---: |
| Thread 0 | Thread 1 |
| $x$ = 1; | r 1 = y ; |
| $y=1 ;$ | $\mathrm{r} 2=\mathrm{x}$; |
| C11 Forbids | $1=1, r 2$ |

## Hardware Implements Weak Memory Models

- Most processors don’t implement SC
- x86: Total Store Order (TSO): Relaxes Write->Read ordering
- ARMv8 and Power relax more orderings
- Compilation to weak memory ISAs must maintain ordering guarantees
- [Owens et al. TPHOLS 2009], [Batty et al. POPL 2011, POPL 2012], [Wickerson et al. OOPSLA 2015], ...


## C11 Source Code

| atomic<int> $x=0 ;$ <br> atomic<int> $y=0 ;$ |  |
| :--- | :--- |
| Thread 0 | Thread 1 |
| $x=1 ;$ | $r 1=y ;$ |
| $y=1 ;$ | $r 2=x ;$ |
| C11 Forbids: $r 1=1, r 2=0$ |  |

ARMv8 Assembly Language

| Initially, $[x]=[y]=0$ |  |
| :---: | :---: |
| Core 0 | Core 1 |
| stl \#1, [x] | lda r1, [y] |
| stl \#1, [y] | lda r2, [x] |
| ARMv8 forbids: $\mathrm{r} 1=1, \quad \mathrm{r} 2=0$ |  |

## Is the ARMv8 hardware correctly implementing

the ARMv8 MCM?

## MCM Verification is a Full-Stack Problem!

[Batty et al. POPL 2011, POPL 2012]
[Alglave et al. TOPLAS 2014]
[Wickerson et al. OOPSLA 2015]

```
High-Level Languages (HLL)
```

Compiler


Architecture (ISA)

Is compiler maintaining HLL guarantees?

Is the ISA-level MCM formally defined?

- Each layer has responsibilities for ensuring correct MCM operation
- Need MCM checking tools at all layers of the computing stack!


## MCM Verification is a Full-Stack Problem!

[Batty et al. POPL 2011, POPL 2012]
[Alglave et al. TOPLAS 2014]
[Wickerson et al. OOPSLA 2015]


Microarchitecture

Processor RTL

Is compiler maintaining HLL guarantees?

Are virtual memory mappings correct?

Is the ISA-level MCM formally defined?

Is hardware incorrectly reordering instructions?

Is RTL correctly
implementing microarchitecture?

- Each layer has responsibilities for ensuring correct MCM operation
- Need MCM checking tools at all layers of the computing stack!


## MCM Verification is a Full-Stack Problem!

[Batty et al. POPL 2011, POPL 2012]
[Alglave et al. TOPLAS 2014]
[Wickerson et al. OOPSLA 2015]

...

| High-Level Languages (HLL) | HLL guarantees? |
| :---: | :---: |
| Compiler | OS |
| Architecture (ISA) | Are virtual memory <br> mappings correct? |
| Is the ISA-level MCM |  |
| formally defined? |  |

- Each layer has responsibilities for ensuring correct MCM operation
- Need MCM checking tools at all layers of the computing stack!


## MCM Verification is a Full-Stack Problem!

[Batty et al. POPL 2011, POPL 2012]
[Alglave et al. TOPLAS 2014]
[Wickerson et al. OOPSLA 2015]



- Each layer has responsibilities for ensuring correct MCM operation
- Need MCM checking tools at all layers of the computing stack!


## Check Suite: Full-Stack Automated MCM Analysis

| High-Level Languages (HLL) |  |
| :---: | :--- |
|  | TriCheck <br> [Trippel et al. ASPLOS 2017] |
| Compiler | COATCheck <br> [Lustig et al. ASPLOS 2016] |
| Architecture (ISA) | PipeCheck \& CCICheck <br> [Lustig et al. MICRO 2014] <br> [Manerkar et al. MICRO 2015] <br> RTLCheck <br> [Manerkar et al. MICRO 2017] |
| Processor RTL |  |

- Suite of tools at various levels of computing stack
- Automated Full-Stack MCM checking across litmus test suites


## Check Suite: Full-Stack Automated MCM Analysis



- Suite of tools at various levels of computing stack
- Automated Full-Stack MCM checking across litmus test suites


## Check Suite: Full-Stack Automated MCM Analysis



- Suite of tools at various levels of computing stack
- Automated Full-Stack MCM checking across litmus test suites


## Check Suite: Full-Stack Automated MCM Analysis



- Suite of tools at various levels of computing stack
- Automated Full-Stack MCM checking across litmus test suites


## Check Suite: Full-Stack Automated MCM Analysis

| High-Level Languages (HLL) |  |
| :---: | :--- |
| Compiler | TriCheck <br> [Trippel et al. ASPLOS 2017] |
| Architecture (ISA) | COATCheck <br> [Lustig et al. ASPLOS 2016] |
| Microarchitecture | PipeCheck \& CCICheck <br> [Lustig et al. MICRO 2014] <br> [Manerkar et al. MICRO 2015] <br> RTLCheck <br> [Manerkar et al. MICRO 2017] |
| Processor RTL |  |

So far, tools have found bugs in:

- Widely-used gem5 Research simulator
- Cache coherence paper (TSO-CC)
- IBM XL C++ compiler (fixed in v13.1.5)
- In-design commercial processors
- RISC-V draft ISA specification
- Compiler mapping proofs
- C11 memory model
- Open-source processor RTL
- Suite of tools at various levels of computing stack
- Automated Full-Stack MCM checking across litmus test suites


## Modelling Microarchitecture: Going below the ISA

- Hardware enforces consistency model using smaller localized orderings
- In-order fetch/decode/execute...
- Orderings enforced by memory hierarchy
- ...and many more



## Modelling Microarchitecture: Going below the ISA

- Hardware enforces consistency model using smaller localized orderings
- In-order fetch/decode/execute...
- Orderings enforced by memory hierarchy
- ...and many more


Pipeline stages may be FIFO to ensure in-order execution

- Hardware enforces consistency model using smaller localized ordering



## Do individual orderings correctly work together

 to satisfy consistency model?
## WB

## WB

## Microarchitectural Consistency Checking

Microarchitecture in $\mu \mathrm{spec}$ DSL
Axiom "Decode_is_FIFO":
... EdgeExists ((i1, Decode), (i2, Decode))
=> AddEdge ((i1, Execute), (i2, Execute)).
Axiom "PO_Fetch":
... SameCore i1 i2 / AddEdge ((i1, Fetch), (i2, Fetch)).


Litmus Test

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| (i2) $[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Microarchitectural Consistency Checking

Microarchitecture in $\mu \mathrm{spec}$ DSL

```
Axiom "Decode_is_FIFO":
... EdgeExists ((i1, Decode), (i2, Decode))
    => AddEdge ((i1, Execute), (i2, Execute)).
```

Axiom "PO_Fetch":
... SameCore i1 i2 /\ P ogramOrder i1 i2 =>
AddEdge ((i1, Fetch), (i2, Fetch)).

Each axiom specifies an ordering that $\mu$ arch should respect

| (i1) $[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| :--- | :--- |
| (i2) $[\mathrm{y}] \leftarrow 1$ | (i4) $\mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Microarchitectural Consistency Checking

Microarchitecture in $\mu \mathrm{spec}$ DSL
Axiom "Decode_is_FIFO":
... EdgeExists ((i1, Decode), (i2, Decode))
=> AddEdge ((i1, Execute), (i2, Execute)).
Axiom "PO_Fetch":
... SameCore i1 i2 / AddEdge ((i1, Fetch), (i2, Fetch)).


Litmus Test

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| (i2) $[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Microarchitectural Consistency Checking

Microarchitecture in $\mu \mathrm{spec}$ DSL
Axiom "Decode_is_FIFO":
... EdgeExists ((i1, Decode), (i2, Decode))
$\Rightarrow$ AddEdge ((i1, Execute), (i2, Execute)).
Axiom "PO_Fetch":
... SameCore i1 i2 /\ ProgramOrder i1 i2 => AddEdge ((i1, Fetch), (i2, Fetch)).


Litmus Test

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| (i2) $[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |



Microarchitectural happens-before ( $\mu \mathrm{hb}$ ) graphs

## Microarchitectural Consistency Checking

Microarchitecture in $\mu \mathrm{spec}$ DSL
Axiom "Decode_is_FIFO":
... EdgeExists ((i1, Decode), (i2, Decode))
=> AddEdge ((i1, Execute), (i2, Execute)).

Axiom "PO_Fetch":
... SameCore i1 i2 /\ ProgramOrder i1 i2 => AddEdge ((i1, Fetch), (i2, Fetch)).

Litmus Test

| Core 0 | Core $\mathbb{1}$ |
| :---: | :---: |
| $(\mathrm{ii})[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |

Microarch. verification checks that combination of axioms satisfies MCM


PipeCheck: Executions as $\mu \mathrm{hb}$ Graphs [Lustig et al. MICRO 2014]

Litmus Test mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{y}]$ |
| (i2) St [y] $\leftarrow 1$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| Under TSO: Forbid r1=1, r2=0 |  |

PipeCheck: Executions as $\mu \mathrm{hb}$ Graphs [Lustig et al. micro 2014]


Litmus Test mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{y}]$ |
| (i2) St [y] $\leftarrow 1$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| Under TSO: Forbid r1 $1=1, \mathrm{r} 2=0$ |  |

PipeCheck: Executions as $\mu \mathrm{hb}$ Graphs [Lustig et al. micro 2014]


Litmus Test mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{y}]$ |
| (i2) St [y] $\leftarrow 1$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| Under TSO: Forbid r1=1, r2=0 |  |

PipeCheck: Executions as $\mu \mathrm{hb}$ Graphs [Lustig et al. micro 2014]


## Core 1



Litmus Test mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{y}]$ |
| (i2) St $[\mathrm{y}] \leftarrow 1$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| Under TSO: Forbid r1 $1=1, \mathrm{r} 2=0$ |  |

PipeCheck: Executions as $\mu \mathrm{hb}$ Graphs [Lustig et al. micro 2014]

Core 0


## Core 1

(i2)


Litmus Test mp

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{y}]$ |
| (i2) St $[\mathrm{y}] \leftarrow 1$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| Under TSO: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

PipeCheck: Executions as $\mu \mathrm{hb}$ Graphs [Lustig et al. MICRO 2014]

Core 0


## Core 1

(i1)
(i2)
(i3)
(i4)



Litmus Test mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{y}]$ |
| (i2) St [y] $\leftarrow 1$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| Under TSO: Forbid r1=1, r2=0 |  |

PipeCheck: Executions as $\mu \mathrm{hb}$ Graphs [Lustig et al. MICRO 2014]

Core 0


## Core 1

(i1)
(i2)
(i3)
(i4)


Litmus Test mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{y}]$ |
| (i2) St [y] $\leftarrow 1$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| Under TSO: Forbid r1 $1=1, \mathrm{r} 2=0$ |  |

PipeCheck: Executions as $\mu \mathrm{hb}$ Graphs [Lustig et al. MICRO 2014]

Core 0

(i1)
(i2)
(i3)
(i4)


Litmus Test mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{y}]$ |
| (i2) St [y] $\leftarrow 1$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| Under TSO: Forbid r1 $1=1, \mathrm{r} 2=0$ |  |

## PipeCheck: Microarchitectural Correctness



| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| (i2) $[\mathrm{y}] \leftarrow 1$ | (i4) $\mathrm{r} 2 \leftarrow[\mathrm{x}]$ |

- Cycle in $\mu \mathrm{hb}$ graph => event has to happen before itself (impossible)
- Cyclic graph $\rightarrow$ unobservable on $\mu$ arch
- Acyclic graph $\rightarrow$ observable on $\mu$ arch
- Exhaustively enumerate and check all possible execs of litmus test on $\mu$ arch
- Implemented using fast SMT solvers
- Compare against ISA-level outcome from herd [Alglave et al. TOPLAS 2014]


## PipeCheck: Microarchitectural Correctness

## (i1) <br> (i2) <br> (i3) <br> (i4)



D
X

M

W
SB
MemHier
Compl.


| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |

Under SC: Forbid r1=1, r2=0

- Cycle in $\mu \mathrm{hb}$ graph => event has to happen before itself (impossible)
- Cyclic graph $\rightarrow$ unobservable on $\mu$ arch
- Acyclic graph $\rightarrow$ observable on $\mu$ arch
- Exhaustively enumerate and check all possible execs of litmus test on $\mu$ arch
- Implemented using fast SMT solvers
- Compare against ISA-level outcome from herd [Alglave et al. TOPLAS 2014]

| ISA-Level <br> Outcome | Observable <br> ( $\geq 1$ Graph Acyclic) | Not Observable <br> (All Graphs Cyclic) |
| :---: | :---: | :---: |
| Allowed | OK | OK (stricter <br> than necessary) |
| Forbidden | Consistency violation! | OK |

## PipeCheck: Microarchitectural Correctness

## (i1) <br> (i2) <br> (i3) <br> (i4)



D
X

M

W
SB
MemHier
Compl.


| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | (i4) $\mathrm{r} 2 \leftarrow[\mathrm{x}]$ |

- Cycle in $\mu \mathrm{hb}$ graph => event has to happen before itself (impossible)
- Cyclic graph $\rightarrow$ unobservable on $\mu$ arch
- Acyclic graph $\rightarrow$ observable on $\mu$ arch
- Exhaustively enumerate and check all possible execs of litmus test on $\mu$ arch
- Implemented using fast SMT solvers
- Compare against ISA-level outcome from herd [Alglave et al. TOPLAS 2014]

| ISA-Level <br> Outcome | Observable <br> ( $\geq 1$ Graph Acyclic) | Not Observable <br> (All Graphs Cyclic) |
| :---: | :---: | :---: |
| Allowed | OK | OK (stricter <br> than necessary) |
| Forbidden | Consistency violation! | OK |

## PipeCheck: Microarchitectural Correctness

## (i1) <br> (i2) <br> (i3) <br> (i4)



D

X

M

W
SB
MemHier
Compl.


| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |

Under SC: Forbid r1=1, r2=0

- Cycle in $\mu \mathrm{hb}$ graph => event has to happen before itself (impossible)
- Cyclic graph $\rightarrow$ unobservable on $\mu$ arch - Acyclic graph $\rightarrow$ observable on $\mu$ arch
- Exhaustively enumerate and check all possible execs of litmus test on $\mu$ arch
- Implemented using fast SMT solvers
- Compare against ISA-level outcome from herd [Alglave et al. TOPLAS 2014]

| ISA-Level <br> Outcome | Observable <br> $(\geq 1$ Graph Acyclic) | Not Observable <br> (All Graphs Cyclic) |
| :---: | :---: | :---: |
| Allowed | OK | OK (stricter <br> than necessary) |
| Forbidden | Consistency violation! | OK |

## PipeCheck: Microarchitectural Correctness



MemHier

Compl.
Litmus Test mp

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | (i3) $\mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| (i2) $[\mathrm{y}] \leftarrow 1$ | (i4) $\mathrm{r} 2 \leftarrow[\mathrm{x}]$ |

Under SC: Forbid r1=1, r2=0

- Cycle in $\mu \mathrm{hb}$ graph => event has to happen before itself (impossible)
- Cyclic graph $\rightarrow$ unobservable on $\mu$ arch
- Acyclic graph $\rightarrow$ observable on $\mu$ arch
- Exhaustively enumerate and check all possible execs of litmus test on $\mu$ arch
- Implemented using fast SMT solvers
- Compare against ISA-level outcome from herd [Alglave et al. TOPLAS 2014]

| ISA-Level <br> Outcome | Observable <br> ( $\geq 1$ Graph Acyclic) | Not Observable <br> (All Graphs Cyclic) |
| :---: | :---: | :---: |
| Allowed | OK | OK (stricter <br> than necessary) |
| Forbidden | Consistency violation! | OK |

## Abstracted memory hierarchy prevents verification of complex coherence issues!

## CCICheck: Coherence vs Consistency

- Memory hierarchy is a collection of caches
- Coherence protocols ensure that all caches agree on the value of any variable
- CCICheck [Manerkar et al. MICRO 2015] shows that consistency verification often cannot simply treat

Architecture (ISA)

Microarchitecture memory hierarchy abstractly

- Nominated for Best Paper at MICRO 2015



## CCICheck: Coherence vs Consistency

- Memory hierarchy is a collection of caches
- Coherence protocols ensure that all caches agree on the value of any variable
- CCICheck [Manerkar et al. MICRO 2015] shows that consistency verification often cannot simply treat

Architecture (ISA)

Microarchitecture memory hierarchy abstractly

- Nominated for Best Paper at MICRO 2015



## Coherence Protocol Example

- If P1 updates the value of $x$ to 200, the stale value of $x$ in other processors must be invalidated
- If P3 wants to subsequently read/write $x$, it must request the new value
- SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant

Processors

Caches


## Coherence Protocol Example

- If P1 updates the value of $x$ to 200 , the stale value of $x$ in other processors must be invalidated
- If P3 wants to subsequently read/write $x$, it must request the new value
- SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant

Processors

Caches


## Coherence Protocol Example

- If P1 updates the value of $x$ to 200, the stale value of $x$ in other processors must be invalidated
- If P3 wants to subsequently read/write $x$, it must request the new value
- SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant


Invalidations

## Coherence Protocol Example

- If P1 updates the value of $x$ to 200, the stale value of $x$ in other processors must be invalidated
- If P3 wants to subsequently read/write $x$, it must request the new value
- SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant

Processors

Caches


## Coherence Protocol Example

- If P1 updates the value of $x$ to 200, the stale value of $x$ in other processors must be invalidated
- If P3 wants to subsequently read/write $x$, it must request the new value
- SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant


Request Data

## Coherence Protocol Example

- If P1 updates the value of $x$ to 200, the stale value of $x$ in other processors must be invalidated
- If P3 wants to subsequently read/write $x$, it must request the new value
- SWMR = Single-Writer Multiple Readers, DVI = Data Value Invariant


Data Response

## Motivating Example - "Peekaboo"" Sorin etal. Pimer 2011]

- Three optimizations: correct individually, but not in combination


## Motivating Example - "Peekaboo"" Sorin etal. Pimer 2011]

- Three optimizations: correct individually, but not in combination

1. Prefetching

## Motivating Example - "Peekaboo"" Sorin eta. Primer 2011]

- Three optimizations: correct individually, but not in combination

1. Prefetching
2. Invalidation before use

- Invalidation can arrive before data
- Acknowledge Inv early rather than wait for data to arrive
- But repeated inv before use $\rightarrow$ livelock [Kubiatowicz et al. ASPLOS 1992]


## Motivating Example - "Peekaboo" [Sorin etal. Primer 2011]

- Three optimizations: correct individually, but not in combination

1. Prefetching
2. Invalidation before use

- Invalidation can arrive before data
- Acknowledge Inv early rather than wait for data to arrive
- But repeated inv before use $\rightarrow$ livelock [Kubiatowicz et al. ASPLOS 1992]

3. Livelock avoidance: allow destination core to perform one operation on data when it arrives, even if already invalidated [Sorin et al. Primer 2011]

- Does not break coherence
- Sometimes intentionally returns stale data


## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## Core 1

x: Invalid $y$ : Invalid

$$
r 1 \leftarrow[y]
$$

$$
r 2 \leftarrow[x]
$$

## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## Core 1

x: Invalíd
y : Invalid

$$
r 1 \leftarrow[y]
$$

$r 2 \leftarrow[x]$

## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance

| $\quad$ Core0 |
| :--- |
| x: Shared |
| y: Modified |
| Data $(x=0)$ |
| $[x]<1$ |
| $[y]<1$ |
|  |

## Core 1

$\mathrm{x}:$ Invalid
y : Invalid
$r 1 \leftarrow[y]$
$r 2 \leftarrow[x]$

## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## Motivating Example - "Peekaboo"

- Consider mp with the livelock-avoidance mechanism:

| Core 0 | Core 1 |
| :---: | :---: |
| $(\mathrm{i} 1)[\mathrm{x}] \leftarrow 1$ | $(\mathrm{i} 3) \mathrm{r} 1 \leftarrow[\mathrm{y}]$ |
| $(\mathrm{i} 2)[\mathrm{y}] \leftarrow 1$ | $(\mathrm{i} 4) \mathrm{r} 2 \leftarrow[\mathrm{x}]$ |
| Under SC: Forbid $\mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

## Optimizations:

1. Prefetching
2. Invalidation-before-use
3. Livelock avoidance


## The Coherence-Consistency Interface (CCI)

- CCI = coherence protocol guarantees to microarch. + orderings microarch. expects from coherence protocol



## The Coherence-Consistency Interface (CCI)

- CCI = coherence protocol guarantees to microarch. + orderings microarch. expects from coherence protocol



## The Coherence-Consistency Interface (CCI)

- CCI = coherence protocol guarantees to microarch. + orderings microarch. expects from coherence protocol



## The Coherence-Consistency Interface (CCI)

- CCI = coherence protocol guarantees to microarch. + orderings microarch. expects from coherence protocol



## The Coherence-Consistency Interface (CCI)

- CCI = coherence protocol guarantees to microarch. + orderings microarch. expects from coherence protocol



## The Coherence-Consistency Interface (CCI)

- CCI = coherence protocol guarantees to microarch. + orderings microarch. expects from coherence protocol



## The Coherence-Consistency Interface (CCI)

- CCI = coherence protocol guarantees to microarch. + orderings microarch. expects from coherence protocol



## The Coherence-Consistency Interface (CCI)

- CCI = coherence protocol guarantees to microarch. + orderings microarch. expects from coherence protocol



## ViCL: Value in Cache Lifetime

- Need a way to model cache occupancy and coherence events for:
- Coherence protocol optimizations (eg: Peekaboo)
- Partial incoherence and lazy coherence (GPUs, etc)
- A ViCL is a 4-tuple:
(cache_id, address, data_value, generation_id)
- cache_id and generation_id uniquely identify each cache line
- A ViCL 4-tuple maps on to the period of time over which the cache line serves the data value for the address


## ViCLs in $\mu \mathrm{hb}$ Graphs



- ViCLs start at a ViCL Create event and end at a ViCL Expire event
- Correspond to nodes in $\mu \mathrm{hb}$ graphs
- Axioms over these nodes and edges enforce coherence and data movement orderings
- Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{x}]$ |
| (i2) St [x] $\leftarrow 2$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| In TSO: r1 $=2$, r2 $2=2$ Allowed |  |

## ViCLs in $\mu \mathrm{hb}$ Graphs



- ViCLs start at a ViCL Create event and end at a ViCL Expire event
- Correspond to nodes in $\mu \mathrm{hb}$ graphs
- Axioms over these nodes and edges enforce coherence and data movement orderings
- Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{x}]$ |
| (i2) St $[\mathrm{x}] \leftarrow 2$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| In TSO: r1 $=2$, r2 $2=2$ Allowed |  |

## ViCLs in $\mu \mathrm{hb}$ Graphs

## (i1)

(i2)
(i3)
(i4)


SourcedFrom

- ViCLs start at a ViCL Create event and end at a ViCL Expire event
- Correspond to nodes in $\mu \mathrm{hb}$ graphs
- Axioms over these nodes and edges enforce coherence and data movement orderings
- Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{x}]$ |
| (i2) St [x] $\leftarrow 2$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| In TSO: r1 $=2$, r2 $2=2$ Allowed |  |

## ViCLs in $\mu \mathrm{hb}$ Graphs



- ViCLs start at a ViCL Create event and end at a ViCL Expire event
- Correspond to nodes in $\mu \mathrm{hb}$ graphs
- Axioms over these nodes and edges enforce coherence and data movement orderings
- Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{x}]$ |
| (i2) St [x] $\leftarrow 2$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| In TSO: r1 $=2$, r2 $2=2$ Allowed |  |

## ViCLs in $\mu \mathrm{hb}$ Graphs



- ViCLs start at a ViCL Create event and end at a ViCL Expire event
- Correspond to nodes in $\mu \mathrm{hb}$ graphs
- Axioms over these nodes and edges enforce coherence and data movement orderings
- Use pipeline model from PipeCheck, but add ViCL nodes and edges

Litmus Test co-mp

| Core 0 | Core 1 |
| :---: | :--- |
| (i1) St $[\mathrm{x}] \leftarrow 1$ | (i3) Ld r1 $\leftarrow[\mathrm{x}]$ |
| (i2) St [x] $\leftarrow 2$ | (i4) Ld r2 $\leftarrow[\mathrm{x}]$ |
| In TSO: r1 $=2$, r2 $2=2$ Allowed |  |

## $\mu \mathrm{hb}$ Graph for the Peekaboo Problem

FetchStage

## $\mu \mathrm{hb}$ Graph for the Peekaboo Problem



## $\mu \mathrm{hb}$ Graph for the Peekaboo Problem

FetchStage

## $\mu \mathrm{hb}$ Graph for the Peekaboo Problem



## $\mu \mathrm{hb}$ Graph for the Peekaboo Problem

FetchStage

CCICheck Takeaways

- Coherence \& consistency often closely coupled in implementations
- In such cases, coherence \& consistency cannot be verified separately
- CCICheck: CCI-aware microarchitectural MCM checking
- Uses ViCL (Value in Cache Lifetime) abstraction
- Discovered bug in TSO-CC lazy coherence protocol


## ISA-level MCMs in the Hardware-Software Stack

High-Level Languages (HLLs)

New ISA-level MCM

## ISA-level MCMs in the Hardware-Software Stack

High-Level Languages (HLLs)


Hardware

## ISA-level MCMs in the Hardware-Software Stack

High-Level Languages (HLLs)

| Which orderings does |
| :---: |
| the compiler need to |
| enforce? |

## New ISA-level MCM

Which orderings
must be guaranteed by hardware?

Hardware

## TriCheck checks that HLL, compiler, ISA, and hardware align on MCM requirements

## TriCheck: Layers of the Stack are Intertwined

High-Level Languages (HLL)

Compiler

- ISA-level MCMs should allow microarchitectural optimizations but also be compatible with HLLs
- TriCheck [Trippel et al. ASPLOS 2017] enables holistic analysis of HLL memory model, ISA-level MCM, compiler mappings, and microarchitectures
- Mapping: translation of HLL synchronization primitives to one or more assembly language instructions
- Also useful for checking HLL compiler mappings to ISA-level MCMs
- Selected as one of 12 "Top Picks of Comp. Arch. Conferences" for 2017


## TriCheck: Comparing HLL to Microarchitecture

HLL
Model
e.g. C11


HLL to ISA
Compiler
Mapping

## TriCheck: Comparing HLL to Microarchitecture

HLL
Model
e.g. C11

HLL to ISA
Compiler
Mapping
$\mu$ spec
Microarch.
Model

## TriCheck: Comparing HLL to Microarchitecture



Translate HLL Litmus Tests to ISA-level litmus tests

## TriCheck: Comparing HLL to Microarchitecture



## TriCheck: Comparing HLL to Microarchitecture



## TriCheck: Comparing HLL to Microarchitecture



## TriCheck: Comparing HLL to Microarchitecture



## TriCheck: Comparing HLL to Microarchitecture



## TriCheck: Comparing HLL to Microarchitecture



Using TriCheck for ISA MCM Design: RISC-V

- Ran TriCheck on draft RISC-V ISA MCM with
- C11 HLL MCM [Batty et al. POPL 2011] [Batty et al. POPL 2016]
- Compiler mappings based on RISC-V manual
- Variety of microarchitectures that relaxed various memory orderings
- All legal according to draft RISC-V spec
- Ranging from SC microarchitecture to one with reorderings allowed by ARM/Power
- Draft RISC-V MCM for Base ISA incapable of correctly compiling C11:
- C11 outcome forbidden, but impossible to forbid on hardware
- RISC-V fences too weak to restore orderings that implementations could relax

Current RISC-V Status

- In response to our findings, RISC-V Memory Model Working Group was formed (we are members)
- Mandate to create an MCM for RISC-V that satisfies community needs
- Working Group has developed an MCM proposal that fixes the aforementioned bugs (and other issues)
- MCM proposal recently passed the 45-day public feedback period!
- Well on its way to being included in the next version of the RISC-V ISA spec


## TriCheck: Analysing Compiler Mappings



## TriCheck: Analysing Compiler Mappings



Checking C11 Mappings to ARMv7/Power

- Ran TriCheck on microarch. with reordering similar to ARMv7/Power
- Utilised "trailing-sync" compiler mapping [Batty et al. POPL 2012]
- Discovered 2 cases where C11 outcome forbidden, but allowed by hardware!
- Deduced that the mapping must be flawed
- Mapping was supposedly proven correct [Batty et al. POPL 2012]
- Traced the loophole in the proof [Manerkar et al. CoRR'16]
- Problem: C11 model slightly too strong for mappings
- C11 has happens-before ( $h b$ ) ordering and total order on all SC accesses (sc)
- hb and sc orders must agree with each other
- Trailing-sync mapping does not guarantee this for our counterexamples


## Current state of C11

- "Leading-sync" mapping [McKenney and Silvera 2011]
- Counterexample discovered concurrently to us [Lahav et al. PLDI 2017]
- Both mappings currently broken
- Possible solutions under discussion by C11 memory model committee:
- RC11 [Lahav et al. PLDI 2017]: remove req. that $s c$ and $h b$ orders agree
- Current mappings work, but reduces intuition in an already complicated C11 model
- Adding extra fences to mappings
- low performance, requires recompilation, counterexample pattern not common


## TriCheck Takeaways

- Both HLL memory models and microarchitectural optimizations influence the design of ISA-level MCMs
- TriCheck enables holistic analysis of HLL memory model, ISA-level MCM, compiler mappings, and microarchitectural implementations
- TriCheck discovered numerous issues with draft RISC-V MCM
- Influenced the design of the new RISC-V MCM
- Discovered two counterexamples to C11 -> ARMv7/Power compiler mappings
- Mappings were previously "proven" correct; isolated flaw in proof


## Memory Consistency Checking for RTL <br>  <br> Microarchitecture Checking

## Memory Consistency Checking for RTL

 How to ensure RTL maintains orderings?

## Memory Consistency Checking for RTL

 How to ensure RTL maintains orderings?

## Memory Consistency Checking for RTL

 How to ensure RTL maintains orderings?

## RTLCheck: Checking RTL Implementations

- RTLCheck [Manerkar et al. MICRO 2017] enables checking microarchitectural axioms against an implementation's Verilog RTL for litmus test suites
- This helps ensure that the RTL maintains orderings required for consistency
- Selected as an Honorable Mention from the "Top Picks of Comp. Arch. Conferences" for 2017

RTL Verification is Maturing...

- ...but usually ignores memory consistency!
- Often use SystemVerilog Assertions (SVA)

RTL Verification is Maturing...

- ...but usually ignores memory consistency!
- Often use SystemVerilog Assertions (SVA)

ISA-Formal [Reid et al. CAV 2016]
-Instr. Operational Semantics
No MCM verification

- ...but usually ignores memory consistency!
- Often use SystemVerilog Assertions (SVA)


## ISA-Formal [Reid et al. CAV 2016] -Instr. Operational Semantics <br> No MCM verification

DOGReL [Stewart et al. DIFTS 2014]
-Memory subsystem transactions
No multicore MCM verification (?)

RTL Verification is Maturing...

- ...but usually ignores memory consistency!
- Often use SystemVerilog Assertions (SVA)

ISA-Formal [Reid et al. CAV 2016]
-Instr. Operational Semantics
No MCM verification

DOGReL [Stewart et al. DIFTS 2014]
-Memory subsystem transactions
No multicore MCM verification (?)

Kami
[Vijayaraghavan et al. CAV 2015] [Choi et al. ICFP 2017]
-MCM correctness for all programs, but...
Needs Bluespec design and manual proofs!

## Lack of automated memory

## consistency verification at RTL!



RTLCheck: Checking RTL Consistency Orderings


RTLCheck: Checking RTL Consistency Orderings
RTL
Design


RTLCheck automatically translates $\mu$ arch. ordering axioms to temporal properties
Temporal SystemVerilog Assertions (SVA)

Cadence JasperGold (RTL Verifier)

RTLCheck: Checking RTL Consistency Orderings


Meaning can be Lost in Translation！

## 小心地滑

## Meaning can be Lost in Translation！

小心地滑
（Caution：Slippery Floor）

## Meaning can be Lost in Translation！

## Slip Carefully

## 小心地滑

（Caution：Slippery Floor）


## RTLCheck: Checking Consistency at RTL

Axiomatic Microarch. Analysis



## RTLCheck: Checking Consistency at RTL



## RTLCheck: Checking Consistency at RTL



## RTLCheck: Checking Consistency at RTL



## RTLCheck: Checking Consistency at RTL

| Axiomatic |
| :---: |
| Microarch <br> Analysis |



## Abstract nodes and happensbefore edges

## Axiomatic/Temporal Mismatch!

## Temporal

RTL Verification (SVA, etc)


## Outcome Filtering in Axiomatic Analysis

- Outcome Filtering: Restrict test outcome to one particular outcome
- Allows for more efficient verification
- Axiomatic models make outcome filtering easy
mp (Message Passing)

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) r1 $=y ;$ |
| (i2) $y=1 ;$ | (i4) $r 2=x ;$ |

## Outcome Filtering in Axiomatic Analysis

- Outcome Filtering: Restrict test outcome to one particular outcome
- Allows for more efficient verification
- Axiomatic models make outcome filtering easy
mp (Message Passing)

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) r1 $=y ;$ |
| $(i 2) y=1 ;$ | (i4) $r 2=x ;$ |
| Outcome: $r 1=1, r 2=1$ |  |

Execution examined as a whole, so outcome can be enforced!

## Outcome Filtering in Axiomatic Analysis

- Outcome Filtering: Restrict test outcome to one particular outcome
- Allows for more efficient verification
- Axiomatic models make outcome filtering easy
mp (Message Passing)

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | $(i 3) r 1=y ;$ |
| $(i 2) y=1$, | $(i 4) r 2=x ;$ |

Execution examined as a whole, so outcome can be enforced!

## Outcome Filtering in Axiomatic Analysis

- Outcome Filtering: Restrict test outcome to one particular outcome
- Allows for more efficient verification
- Axiomatic models make outcome filtering easy
mp (Message Passing)

| Core 0 | Core 1 |
| :---: | :---: |
| $(i 1) x=1$, | $(i 3) r 1=y ;$ |
| $(i 2) y=1$, | $(14) r 2=x ;$ |

Execution examined as a whole, so outcome can be enforced!

## Outcome Filtering in Temporal Verification

- Filtering executions by outcome requires expensive global analysis
- Not done by many SVA verifiers, including JasperGold!

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) r1 $=\mathrm{y} ;$ |
| $(\mathrm{i} 2) \mathrm{y}=1 ;$ | (i4) $\mathrm{r} 2=\mathrm{x} ;$ |
| Is $\mathrm{r} 1=1, \mathrm{r} 2=0$ possible? |  |

## Outcome Filtering in Temporal Verification

- Filtering executions by outcome requires expensive global analysis
- Not done by many SVA verifiers, including JasperGold!

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) $r 1=y ;$ |
| (i2) $y=1 ;$ | (i4) $r 2=x ;$ |
| Is $r 1=1, r 2=0$ possible? |  |

(i1) $x=1$
Step 1

## Outcome Filtering in Temporal Verification

- Filtering executions by outcome requires expensive global analysis
- Not done by many SVA verifiers, including JasperGold!

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | $(i 3) r 1=y ;$ |
| $(i 2) y=1 ;$ | $(i 4) r 2=x ;$ |
| Is $r 1=1, r 2=0$ possible? |  |


$\rightarrow \frac{(i 1) x=1}{\text { Step 1 }} \rightarrow \underset{\text { Step } 2}{(i 2) y=1} \rightarrow$| $(i 3) r 1=y=1$ |
| :---: |
| Step 3 |\(\rightarrow \begin{aligned} \& (i 4) r 2=x=1 <br>

\& Step 4\end{aligned}\)

## Outcome Filtering in Temporal Verification

- Filtering executions by outcome requires expensive global analysis
- Not done by many SVA verifiers, including JasperGold!

| mp |  |
| :---: | :---: |
| Core 0 | Core 1 |
| (i1) $\mathrm{x}=1$; | (i3) r1 = y ; |
| (i2) $\mathrm{y}=1$; | (i4) $\mathrm{r} 2=\mathrm{x}$; |
| Is r1 = 1, r2 = 0 possible? |  |



## Outcome Filtering in Temporal Verification

- Filtering executions by outcome requires expensive global analysis
- Not done by many SVA verifiers, including JasperGold!

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) r1 $=y ;$ |
| (i2) $y=1 ;$ | (i4) $r 2=x ;$ |
| Is $r 1=1, r 2=0$ possible? |  |

## Need to examine all possible paths from

 current step to end of execution: too expensive!

SVA Verifier Approximation: Only check if constraints hold up to current step Makes Outcome Filtering impossible!

## $\mu s p e c$ Analysis Uses Outcome Filtering

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | $(i 3) r 1=y ;$ |
| $(i 2) y=1 ;$ | $(i 4) r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

```
Axiom "Read_Values":
Every load either reads BeforeAllWrites OR reads FromLatestWrite
```


## $\mu s p e c$ Analysis Uses Outcome Filtering

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | $(i 3) r 1=y ;$ |
| $(i 2) y=1 ;$ | $(i 4) r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

```
Axiom "Read_Values":
Every load either reads BeforeAllWrites OR reads FromLatestWrite
```


## uspec Analysis Uses Outcome Filtering

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | $(i 3) r 1=y ;$ |
| (i2) $y=1 ;$ | $(i 4) r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

Axiom "Read_Values":
Every load either reads BeforeAllWrites OR reads
FromLatestWrite

## No write for load to read from!

## uspec Analysis Uses Outcome Filtering

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | $(i 3) r 1=y ;$ |
| (i2) $y=1 ;$ | $(i 4) r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

Axiom "Read_Values":
드는 load eithen reads BeforeAllWrites OD neads FnombatectWnite

## Outcome Filtering leads to simpler axioms!

Temporal Outcome Filtering Fails!
Filtered Read_Values:
Unless Load returns non-zero value,
Load happens before all stores to its address
mp

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) $r 1=y ;$ |
| (i2) $y=1 ;$ | $(i 4) r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

Time (cycles)
clk
Core[0]. Commit


Core[0].SData


Core[1]. Commit_X
Core[1]. LData


Temporal Outcome Filtering Fails!

## Filtered Read_Values:

Unless Load returns non-zero value,
Load happens before all stores to its address
mp

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | $(i 3) r 1=y ;$ |
| $(i 2) y=1 ;$ | $(i 4) r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

## After 3 cycles:

Temporal Outcome Filtering Fails!

| Filtered Read_Values: |
| :--- |
| Unless Load returns non-zero value, |
| Load happens before all stores to its address |

mp

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) $r 1=y ;$ |
| $(i 2) y=1 ;$ | (i4) $r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

## After 3 cycles:

Store happens before load! Property Violated?

Temporal Outcome Filtering Fails!

## Filtered Read_Values: <br> Unless Load returns non-zero value, <br> Load happens before all stores to its address



## After 3 cycles:

Store happens before load! Property Violated?

After 6 cycles:
Load does not read 0 No Violation!

Temporal Outcome Filtering Fails!

## Filtered Read_Values: <br> Unless Load returns non-zero value, <br> Load happens before all stores to its address

mp

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) $r 1=y ;$ |
| (i2) $y=1 ;$ | $(i 4) r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

## After 3 cycles:

Store happens before load! Property Violated?

After 6 cycles:
Load does not read 0 No Violation!
But SVA verifiers don't check future cycles!

| Filtered Read_Values: |
| :--- |
| Unless Load returns non-zero value, |
| $\quad$ Load happens before all stores to its address |

mp

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) $r 1=y ;$ |
| (i2) $y=1 ;$ | (i4) $r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

## After 3 cycles:

Store happens before load! Property Violated?

After 6 cycles:
Load does not read 0
No Violation!
But SVA verifiers don't check future cycles!
$m p$

## Solution: Load Value Constraints

- Don’t simplify axioms; translate all cases

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) $r 1=y ;$ |
| (i2) $y=1 ;$ | (i4) $r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

- Tag each case with appropriate load value constraints
- reflect the data constraints required for edge(s)
Axiom "Read_Values":
Every load either reads BeforeAllWrites OR reads FromLatestWrite

Property to check:
mapNode(Ld $x \rightarrow$ St $x$, Ld $x==0$ ) or mapNode(St $x \rightarrow$ Ld $x$, Ld $x==1$ );
mp

## Solution: Load Value Constraints

- Don’t simplify axioms; translate all cases

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) $r 1=y ;$ |
| (i2) $y=1 ;$ | (i4) $r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

- Tag each case with appropriate load value constraints
- reflect the data constraints required for edge(s)

```
Axiom "Read_Values":
Every load either reads BeforeAllWrites OR reads FromLatestWrite
```


## Property to check:

mapNode Ld $x \rightarrow$ St $x$, Ld $x==0$ ) or mapNode(St $x \rightarrow$ Ld $x, \operatorname{Ld} x==1$ );
mp

## Solution: Load Value Constraints

- Don’t simplify axioms; translate all cases

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) $r 1=y ;$ |
| (i2) $y=1 ;$ | (i4) $r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

- Tag each case with appropriate load value constraints
- reflect the data constraints required for edge(s)

```
Axiom "Read_Values":
Every load either reads BeforeAllWrites OR reads FromLatestWrite
```

Property to check:
mapNode(Ld $x \rightarrow$ St $x$, Ld $x==0$ ) or mapNode St $x \rightarrow$ Ld $x, \operatorname{Ld} x==1$ );
mp

## Solution: Load Value Constraints

- Don’t simplify axioms; translate all cases

| Core 0 | Core 1 |
| :---: | :---: |
| (i1) $x=1 ;$ | (i3) $r 1=y ;$ |
| (i2) $y=1 ;$ | (i4) $r 2=x ;$ |
| SC Forbids: $r 1=1, r 2=0$ |  |

- Tag each case with appropriate load value constraints
- reflect the data constraints required for edge(s)

```
Axiom "Read_Values":
Every load either reads BeforeAllWrites OR reads FromLatestWrite
```

Property to check: mapNode(Ld $x \rightarrow$ St $x$, Ld $x==0$ ) or mapNode(St $x \rightarrow \operatorname{Ld} x$, Ld $x==1$ );

Multi-V-scale: a Multicore Case Study


Memory

Multi-V-scale: a Multicore Case Study


## Multi-V-scale: a Multicore Case Study



## Bug Discovered in V-scale

- V-scale memory internally writes stores to wdata register
- wdata pushed to memory when subsequent store occurs
- Akin to single-entry store buffer
- When two stores are sent to

Core 0


Core 1


Core 2


Core 3


Arbiter memory in successive cycles, first of two stores is dropped by memory!

- Fixed bug by eliminating wdata
- V-scale has since been deprecated by RISC-V Foundation



## Bug Discovered in V-scale

- V-scale memory internally writes stores to wdata register
- wdata pushed to memory when subsequent store occurs
- Akin to single-entry store buffer
- When two stores are sent to

Core 0


Core 1


Core 2


Core 3


Arbiter memory in successive cycles, first of two stores is dropped by memory!

- Fixed bug by eliminating wdata
- V-scale has since been deprecated by RISC-V Foundation



## Bug Discovered in V-scale

- V-scale memory internally writes stores to wdata register
- wdata pushed to memory when subsequent store occurs
- Akin to single-entry store buffer
- When two stores are sent to

Core 0


Core 1


Core 2


Core 3


Arbiter memory in successive cycles, first of two stores is dropped by memory!

- Fixed bug by eliminating wdata
- V-scale has since been deprecated by RISC-V Foundation


RTLCheck Takeaways

- Microarchitectural models must be validated against RTL
- RTLCheck: Automated translation of microarch. axioms into equivalent temporal SVA properties for litmus test suites
- Translation is complicated by the axiomatic-temporal mismatch
- JasperGold was able to prove $90 \%$ of properties/test in 11 hours runtime
- Last piece of the Check suite; now have tools at all levels of the stack!


## Conclusion

High-Level Languages (HLL)

Compiler

Architecture (ISA)

Microarchitecture

Processor RTL

- The Check suite provides automated full-stack MCM checking of implementations
- Litmus-test based verification to concentrate on error-prone cases
- Can check:
- Implementation of HLL requirements
- Virtual memory implementation
- HLL Compiler mappings
- Microarchitectural Orderings (including coherence)
- and even RTL (Verilog)!
- All tools are open-source and publicly available!


## With Thanks to...

- Collaborators:
- Margaret Martonosi
- Daniel Lustig
- Caroline Trippel
- Michael Pellauer
- Aarti Gupta
- Funding:
- Princeton Wallace Memorial Honorific Fellowship
- STARnet C-FAR (Center for Future Architectures Research)
- JUMP ADA Center (Applications Driving Architectures)
- National Science Foundation


## Questions?

## http://www.cs.princeton.edu/~manerkar

- Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi, and Michael Pellauer. RTLCheck: Verifying the Memory Consistency of RTL Designs. The 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), October 2017.
- Yatin A. Manerkar, Caroline Trippel, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. Counterexamples and Proof Loophole for the C/C++ to POWER and ARMv7 Trailing-Sync Compiler Mappings. CoRR abs/1611.01507, November 2016.
- Caroline Trippel, Yatin A. Manerkar, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. TriCheck: Memory Model Verification at the Trisection of Software, Hardware, and ISA. The 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), April 2017.
- Yatin A. Manerkar, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. CCICheck: Using $\mu \mathrm{hb}$ Graphs to Verify the Coherence-Consistency Interface. The 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2015.


## http://check.cs.princeton.edu/

## Coherence and Consistency

- Most coherence protocols are not that simple!
- Partial incoherence (e.g. GPUs) [Wickerson et al. OOPSLA 2016]
- Lazy coherence (e.g. TSO-CC) [Elver and Nagarajan HPCA 2014]
- CCI: Coherence-Consistency Interface



## Consistency

Conceptual

## Coherence and Consistency

- Most coherence protocols are not that simple!
- Partial incoherence (e.g. GPUs) [Wickerson et al. OOPSLA 2016]
- Lazy coherence (e.g. TSO-CC) [Elver and Nagarajan HPCA 2014]
- CCI: Coherence-Consistency Interface



## Coherence and Consistency

- Most coherence protocols are not that simple!
- Partial incoherence (e.g. GPUs) [Wickerson et al. OOPSLA 2016]
- Lazy coherence (e.g. TSO-CC) [Elver and Nagarajan HPCA 2014]
- CCI: Coherence-Consistency Interface


Coherence and consistency often interwoven

## Coherence and Consistency

- Most coherence protocols are not that simple!
- Partial incoherence (e.g. GPUs) [Wickerson et al. OOPSLA 2016]
- Lazy coherence (e.g. TSO-CC) [Elver and Nagarajan HPCA 2014]
- CCI: Coherence-Consistency Interface



## Issue with Draft RISC-V MCM: Cumulativity

- Consider this litmus test variant (WRC):
- C11 atomics can specify memory orderings: REL = release, $\mathrm{ACQ}=$ acquire

| Thread 0 | Thread 1 | Thread 2 |
| :---: | :---: | :---: |
| St (x, 1, REL) | $\mathrm{r} 0=\mathrm{Ld}$ ( $\mathrm{x}, \mathrm{ACQ}$ ) | r 1 = Ld ( y , ACQ) |
|  | St (y, 1, REL) | $\mathrm{r} 2=\mathrm{Ld}(\mathrm{x}, \mathrm{ACQ})$ |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=1, \mathrm{r} 2=0$ |  |  |

- RISC-V lacked cumulative fences to enforce this ordering:
- ( $x 5$ and $x 6$ contain addresses of $x$ and $y$ )

| Core 0 Core 1 | Core 2 |  |
| :--- | :--- | :--- |
| sw x1, (x5) | $l w x 2,(x 5)$ | $l w x 3,(x 6)$ |
|  | fence $r, r w$ | fence $r, ~ r w$ |
|  | fence $r w, w$ | $l w x 4,(x 5)$ |
|  | $s w \times 2,(x 6)$ |  |

## Issue with Draft RISC-V MCM: Cumulativity

- Consider this litmus test variant (WRC):
- C11 atomics can specify memory orderings: REL = release, $\mathrm{ACQ}=$ acquire

| Thread 0 | Thread 1 | Thread 2 |
| :---: | :---: | :---: |
| St ( $x, 1, \mathrm{REL}$ ) | = Ld ( $\mathrm{x}, \mathrm{ACQ}$ ) | r 1 = Ld ( y , ACQ) |
|  | St ( $\mathrm{y}, 1, \mathrm{REL}$ ) | $\mathrm{r} 2=\mathrm{Ld}(\mathrm{x}, \mathrm{ACQ})$ |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=1, \mathrm{r} 2=0$ |  |  |

- RISC-V lacked cumulative fences to enforce this ordering:
- ( $x 5$ and $x 6$ contain addresses of $x$ and $y$ )

| Core 0 Core 1 | Core 2 |  |
| :--- | :--- | :--- |
| sw x1, (x5) | $l w x 2,(x 5)$ | $l w x 3,(x 6)$ |
|  | fence $r, r w$ | fence $r, ~ r w$ |
|  | fence $r w, w$ | $l w x 4,(x 5)$ |
|  | $s w \times 2,(x 6)$ |  |

## Issue with Draft RISC-V MCM: Cumulativity

- Consider this litmus test variant (WRC):
- C11 atomics can specify memory orderings: REL = release, $\mathrm{ACQ}=$ acquire

- RISC-V lacked cumulative fences to enforce this ordering:
- ( $x 5$ and $x 6$ contain addresses of $x$ and $y$ )

| Core 0 Core 1 | Core 2 |  |
| :--- | :--- | :--- |
| sw x1, (x5) | $l w x 2,(x 5)$ | $l w x 3,(x 6)$ |
|  | fence $r, r w$ | fence $r, ~ r w$ |
|  | fence $r w, w$ | $l w x 4,(x 5)$ |
|  | $s w \times 2,(x 6)$ |  |

## Issue with Draft RISC-V MCM: Cumulativity

- Consider this litmus test variant (WRC):
- C11 atomics can specify memory orderings: REL = release, $\mathrm{ACQ}=$ acquire

| Thread 0 | Thread 1 Thread 2 |
| :---: | :---: |
| St (x, 1, REL) | $r 0=\operatorname{Ld}(x, A C Q) \quad$ r1 $=\operatorname{Ld}(y, A C Q)$ |
|  | St ( $\mathrm{y}, 1, \mathrm{REL}$ ) $\quad \mathrm{r} 2=\mathrm{Ld}(\mathrm{x}, \mathrm{ACQ})$ |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

- RISC-V lacked cumulative fences to enforce this ordering:
- ( $x 5$ and $x 6$ contain addresses of $x$ and $y$ )

| Core 0 Core 1 | Core 2 |  |
| :--- | :--- | :--- |
| sw x1, (x5) | $l w x 2,(x 5)$ | $l w x 3,(x 6)$ |
|  | fence $r, r w$ | fence $r, ~ r w$ |
|  | fence $r w, w$ | $l w x 4,(x 5)$ |
|  | $s w \times 2,(x 6)$ |  |

## Issue with Draft RISC-V MCM: Cumulativity

- Consider this litmus test variant (WRC):
- C11 atomics can specify memory orderings: REL = release, $\mathrm{ACQ}=$ acquire

- RISC-V lacked cumulative fences to enforce this ordering:
- ( $x 5$ and $x 6$ contain addresses of $x$ and $y$ )

| Core 0 Core 1 | Core 2 |  |
| :---: | :---: | :---: |
| sw x1, (x5) | $l w x 2,(x 5)$ | $l w x 3,(x 6)$ |
|  | fence $r, r w$ | fence $r, ~ r w$ |
|  | fence $r w, w$ | $l w x 4,(x 5)$ |
|  | $s w \times 2,(x 6)$ |  |

## Issue with Draft RISC-V MCM: Cumulativity

- Consider this litmus test variant (WRC):
- C11 atomics can specify memory orderings: REL = release, $\mathrm{ACQ}=$ acquire

| Thread 0 | Thread 1 Thread 2 |
| :---: | :---: |
| St (x, 1, REL) | $r 0=\operatorname{Ld}(x, A C Q) \quad$ r1 $=\operatorname{Ld}(y, A C Q)$ |
|  | St ( $\mathrm{y}, 1, \mathrm{REL}$ ) $\quad \mathrm{r} 2=\mathrm{Ld}(\mathrm{x}, \mathrm{ACQ})$ |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=1, \mathrm{r} 2=0$ |  |

- RISC-V lacked cumulative fences to enforce this ordering:
- ( $x 5$ and $x 6$ contain addresses of $x$ and $y$ )

| Core 0 Core 1 | Core 2 |  |
| :--- | :--- | :--- |
| sw x1, (x5) | $l w x 2,(x 5)$ | $l w x 3,(x 6)$ |
|  | fence $r, r w$ | fence $r, ~ r w$ |
|  | fence $r w, w$ | $l w x 4,(x 5)$ |
|  | $s w \times 2,(x 6)$ |  |

## ARMv7/Power Trailing-Sync Counterexample

- Consider this litmus test variant (IRIW):
- Total order over all SC atomic accesses is required

| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
| :---: | :---: | :---: | :---: |
| St ( $\mathrm{x}, 1, \mathrm{SC}$ ) | St ( $\mathrm{y}, 1, \mathrm{SC}$ ) | $\mathrm{r} 0=\mathrm{Ld}(\mathrm{x}, \mathrm{ACQ})$ | $\mathrm{r} 2=\mathrm{Ld}(\mathrm{y}, \mathrm{ACQ})$ |
|  |  | r 1 = Ld ( $\mathrm{y}, \mathrm{SC})$ | r3 = Ld ( $\mathrm{x}, \mathrm{SC}$ ) |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=0, \mathrm{r} 2=1, r 3=0$ |  |  |  |

- With the trailing-sync mapping, this compiles to the following:
- Allowed on Power [Sarkar et al. PLDI 2011] and ARMv7 [Alglave et al. TOPLAS 2014]

| Core 0 | Core 1 | Core 2 | Core 3 |
| :---: | :---: | :---: | :---: |
| str 1, [x] | str 1, [y] | ldr r1, [x] | ldr r3, [y] |
|  |  | ctrlisb/ctrlisync | ctrlisb/ctrlisync |
|  |  | ldr r2, [y] | ldr r4, [x] |
| Allowed by Power/ARMv7: $r 1=1, r 2=0, r 3=1, r 4=0$ |  |  |  |

## ARMv7/Power Trailing-Sync Counterexample

- Consider this litmus test variant (IRIW):
- Total order over all SC atomic accesses is required

| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
| :---: | :---: | :---: | :---: |
| St ( $\mathrm{x}, 1, \mathrm{SC}$ ) | St ( $\mathrm{y}, 1, \mathrm{SC}$ ) | r0 = Ld ( $\mathrm{x}, \mathrm{ACQ}$ ) | $\mathrm{r} 2=\operatorname{Ld}(\mathrm{y}, \mathrm{ACQ})$ |
|  |  | $\mathrm{r} 1=\mathrm{Ld}(\mathrm{y}, \mathrm{SC})$ | r3 = Ld ( $\mathrm{x}, \mathrm{SC}$ ) |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=0, \mathrm{r} 2=1, \mathrm{r} 3=0$ |  |  |  |

- SC total order must respect happens-before i.e. (sb U sw)+ a:Wna $x=0$




## ARMv7/Power Trailing-Sync Counterexample

- Consider this litmus test variant (IRIW):
- Total order over all SC atomic accesses is required

| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
| :---: | :---: | :---: | :---: |
| St ( $\mathrm{x}, 1, \mathrm{SC}$ ) | St ( $\mathrm{y}, 1, \mathrm{SC}$ ) | $\mathrm{r} 0=\mathrm{Ld}(\mathrm{x}, \mathrm{ACQ})$ | $\mathrm{r} 2=\mathrm{Ld}(\mathrm{y}, \mathrm{ACQ})$ |
|  |  | $\mathrm{r} 1=\mathrm{Ld}(\mathrm{y}, \mathrm{SC})$ | r3 = Ld ( $\mathrm{x}, \mathrm{SC}$ ) |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=0, \mathrm{r} 2=1, \mathrm{r} 3=0$ |  |  |  |

- SC total order must respect happens-before i.e. (sb U sw)+ a:Wna $x=0$



## ARMv7/Power Trailing-Sync Counterexample

- Consider this litmus test variant (IRIW):
- Total order over all SC atomic accesses is required

| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
| :---: | :---: | :---: | :---: |
| St ( $\mathrm{x}, 1, \mathrm{SC}$ ) | St ( $\mathrm{y}, 1, \mathrm{SC}$ ) | $\mathrm{r} 0=\mathrm{Ld}(\mathrm{x}, \mathrm{ACQ})$ | $\mathrm{r} 2=\mathrm{Ld}(\mathrm{y}, \mathrm{ACQ})$ |
|  |  | $\mathrm{r} 1=\mathrm{Ld}(\mathrm{y}, \mathrm{SC})$ | r3 = Ld ( $\mathrm{x}, \mathrm{SC}$ ) |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=0, \mathrm{r} 2=1, \mathrm{r} 3=0$ |  |  |  |

- SC total order must respect happens-before i.e. (sb U sw)+ a:Wna $x=0$



## ARMv7/Power Trailing-Sync Counterexample

- Consider this litmus test variant (IRIW):
- Total order over all SC atomic accesses is required

| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
| :---: | :---: | :---: | :---: |
| St ( $\mathrm{x}, 1, \mathrm{SC}$ ) | St (y, 1, SC) | $\mathrm{r} 0=\mathrm{Ld}(\mathrm{x}, \mathrm{ACQ})$ | $\mathrm{r} 2=\operatorname{Ld}(\mathrm{y}, \mathrm{ACQ})$ |
|  |  | r1 = Ld ( $\mathrm{y}, \mathrm{SC}$ ) | r3 = Ld ( $\mathrm{x}, \mathrm{SC}$ ) |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=0, \mathrm{r} 2=1, \mathrm{r} 3=0$ |  |  |  |

- SC total order must respect happens-before i.e. (sb U sw)+

[Generated with CPPMEM from Cambridge]

$c: W s c x=1$

$d: W s c y=1$

e:Raca $x=1$



## ARMv7/Power Trailing-Sync Counterexample

- Consider this litmus test variant (IRIW):
- Total order over all SC atomic accesses is required

| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
| :---: | :---: | :---: | :---: |
| St (x, 1, SC) | St (y, 1, SC) | $\mathrm{r} 0=\mathrm{Ld}(\mathrm{x}, \mathrm{ACQ})$ | r2 = Ld ( $y$, ACQ) |
|  |  | $\mathrm{r} 1=\mathrm{Ld}$ ( $\mathrm{y}, \mathrm{SC})$ | r3 = Ld ( $\mathrm{x}, \mathrm{SC}$ ) |
| Forbidden by C11: $\mathrm{r} 0=1, \mathrm{r} 1=0, \mathrm{r} 2=1, \mathrm{r} 3=0$ |  |  |  |

- SC reads must be before later SC writes
$a: W n a x=0$


[Generated with CPPMEM from Cambridge]

ARMv7/Power Trailing-Sync Counterexample - Consider this litmus test variant (IRIW):

- Total order over all SC atomic accesses is required
- Cycle in the SC order implies outcome is forbidden
- But compiled code allows the behaviour!



## What went wrong?

- It was thought that program order and coherence edges directly between SC accesses were all that needed enforcing [Batty et al. POPL 2012]
- But hb edges can arise between SC accesses through the transitive composition of edges to and from a non-SC intermediate access
- Occurs in IRIW counterexample:



## What went wrong?

- It was thought that program order and coherence edges directly between SC accesses were all that needed enforcing [Batty et al. POPL 2012]
- But hb edges can arise between SC accesses through the transitive composition of edges to and from a non-SC intermediate access
- Occurs in IRIW counterexample:



## What went wrong?

- It was thought that program order and coherence edges directly between SC accesses were all that needed enforcing [Batty et al. POPL 2012]
- But hb edges can arise between SC accesses through the transitive composition of edges to and from a non-SC intermediate access
- Occurs in IRIW counterexample:



## Assumption Generation

- Need to restrict executions to those of litmus test
- Three classes of assumptions:
- Memory initialization
- Instr. mem and data mem
- Register initialization
- Value assumptions
- Load value assumptions: loads return correct value (when they occur)
- Final value assumptions: Required final values of memory are respected
- RTLCheck generates SystemVerilog Assumptions to constrain executions
- Utilises user-provided program mapping function


## Assumption Generation

- Covering trace: execution where assumption condition is enforced
- Eg: execution where load of $x$ returns 0
- Must obey all assumptions
- Covering final value assum. == finding forbidden execution!
- No covering trace => equivalent to verifying overall test!
- Quicker verification for some tests
- Expect benefit to be largest for small designs


## The Benefits of Final Value Assumptions

- Why generate final value assumptions if test has no final conditions?
- Answer: Covering traces can lead to faster verification
- These are traces where assumption condition occurs and can be enforced



## The Benefits of Final Value Assumptions

- Why generate final value assumptions if test has no final conditions?
- Answer: Covering traces can lead to faster verification
- These are traces where assumption condition occurs and can be enforced

Covering trace for final val assumption is complete execution of litmus test


## The Benefits of Final Value Assumptions

- Why generate final value assumptions if test has no final conditions?
- Answer: Covering traces can lead to faster verification
- These are traces where assumption condition occurs and can be enforced

Covering trace for final val assumption is complete
execution of litmus test

| Covering trace must also obey other |
| :---: |
| assumptions, including load val assumptions |
| (Formp, Ld $y=1$ and Ld $x=0$ ) |



## The Benefits of Final Value Assumptions

- Why generate final value assumptions if test has no final conditions?
- Answer: Covering traces can lead to faster verification
- These are traces where assumption condition occurs and can be enforced

Covering trace for final val
assumption is complete


## Results: Time to Prove Properties

- Two configurations (Hybrid and Full_Proof), avg. runtime 6.2 hrs
- See paper for configuration details



## Results: Time to Prove Properties

- Two configurations (Hybrid and Full_Proof), avg. runtime 6.2 hrs
- See paper for configuration details


Complete quickly due to covering traces

## Results: Time to Prove Properties

- Two configurations (Hybrid and Full_Proof), avg. runtime 6.2 hrs
- See paper for configuration details



## Results: Proven Properties

- Full_Proof generally better (90\%/test) than Hybrid (81\%/test)
- On average, Full_Proof can prove more properties in same time



## Results: Proven Properties

- Full_Proof generally better (90\%/test) than Hybrid (81\%/test)
- On average, Full_Proof can prove more properties in same time

Hybrid better for only a few tests


