Succint proofs of risc-v transactions using a black-box riscv zkvm

Here are some thoughts about how to prove transactions using risc0/sp1 as a black box.

Motivation

First, I think a good way to conceptualize what we’re trying to do is to think of the zkvm as an actual remote hardware device, that’s not under our control, but can be completely trusted to run according to its specification. We can’t modify the hardware and software on the vm device, but if the specification says “return the output of the running program”, then we can trust that this is what is returned.

The reason we want to design a proof system in this way is the same reason we’d rather write code for a general-purpose CPU as opposed to designing an FPGA — it’s much easier to do and more maintainable, it’s less dependent on a specific zk project, and the set of developers who have the required expertise is much larger. The cost is some efficiency, which will hopefully not be too large.

The reason I feel we should think about this now, even though we don’t intend to support ZK proofs in our initial version, is twofold:

  1. Some proof methods might only be feasible if our contract risc-v code obeys certain restrictions. This means that in order to make proving easy in the future, we might want to restrict the set of programs we allow in the initial version of our (non-ZK) VM. (TL;DR, it might make sense to support only rv32em — a subset of r32im that has 16 registers rather than 32)
  2. We might want to collaborate with one (or more) of the risc-v zkvm projects to add features that would help us.

Black-box model

This is how I understand the bare-bones interface provided by risc0/sp1. In the description, Alice is the user and VM is the remote system:

  • Alice sends the VM an executable, consisting of risc-v machine code and an initial input. The machine code consists of rv32im instructions, and allows the following system calls:

    • Read from host.
    • Write to host
    • Output
    • Crypto accelerators (e.g., run sha256 or poseidon hash)
  • The VM loads the machine code into memory and executes it.

    • Whenever it encounters a “read from host” syscall, it sends a message to Alice, and waits for her to send data. Once she sends the data, it writes it into the VM memory (in some well-defined location) and continues executing the code.

    • Whenever it encounters a “write to host” syscall, it sends the data to Alice.

    • Whenever it encounters an “output” syscall, it outputs the data to the “proof journal”.

    • The program terminates when it encounters an error (e.g., some CPU exception) or a halt command.

    • When the program terminates, the VM “signs” a statement and sends it (with its signature) to Alice. (this corresponds to the statement that is proven by the ZK proof). Let X be the hash of the program Alice sent, o_1,\dotsc,o_n the set of output values it produced using the “output” system call and r the termination reason (e.g. halt, div-by-zero, memory exception, etc.). The statement signed by the VM is:

      • "I executed a program with hash X, for which I know there exist inputs such that the resulting execution outputs were o_1,\dotsc,o_n, and the program ended with reason r.

Note that the actual inputs don’t appear as part of the signed statement. This is important so that recursive proofs don’t grow with the level of recursion. If we do want the proof statement to include the inputs (or, for example, a hash of the inputs), we can always add code to the program copy the inputs to the output (or hash them and output the hash).

Open Questions

There are some critical questions about how exactly the black-box model works, that will greatly affect proving strategies and efficiency. Also, ideally, we would want the VM to provide a few extra services, which I don’t think the current risc0/sp1 provides (but I’m putting this in “open questions” anyway).

  • Access to program code: the proof strategy I outline below assumes the program code cannot be modified by the program itself during execution. Moreover, things become much nicer if the program code can’t even be read by the program itself (this allows us to transform programs easily before sending them to the VM, while guaranteeing that their input/output behavior is unchanged)

    I think the risc-v spec leaves it to the system to define which areas of memory can be accessed by a load/store, and I’m not sure what risc0/sp1 do here.

  • Gas metering: our transaction model (and many others, of course) require more than just the input/output behavior of the program – they need to know how many “virtual cycles” the program took to run (virtual cycles should be a measure of how hard the program is to prove, rather than to simply execute). Moreover, if the program takes too many cycles, they need the VM statement to say something like "the program failed because it ran more than z cycles. riscv actually defines a special register than counts the number of cycles executed so far, and opcodes to read this register (the “zicntr” extension), but I think they aren’t implemented by risc0/sp1 (we would probably need an implementation that uses an external “gas table”, since we want to control the actual instruction costs)

Proposed Proof Technique

Overview

What we actually want to prove are statements that roughly have the following forms:

  • First-level (Tx): This proof is for a tuple (X_0,T,X_1), where X_0 and X_1 are state hashes and T is a transaction hash. The statement is:

    Starting with state-hash X_0, if we run a transaction with hash T then the state-hash after the execution is X_1.

    (There will actually be additional information in the tuple, such as method return values and gas usage, but we’ll ignore that for now, since it’s not relevant for the high-level description)

  • Second-level (Block): This proof is for a tuple (X_0,B,X_n) where X_0 and X_n are state hashes and B is a block hash. The statement is:

    We know a chain of first-level proofs for some set of tuples of the form (X_0,T_1,X_1),\dotsc,(X_{n-1},T_n,X_n), such that H(T_1||\dotsb||T_n)=B and all the proofs in the chain are valid.

    Note that the values of the chain and the proofs are not part of the statement.

  • Third-level (Blockmesh): This proof is for a tuple (L,X), where L is a cumulative layer hash and X a state hash. The statement is:

    • Either L=\bot and X=X_{\text{genesis}} or
    • We know a cumulative layer hash L', a block-hash B, a third-level proof for a tuple (L',X') and a second-level proof for tuple (X', B, X), such that H(L'||B)=L and both proofs are valid.

    Note that the values of L', B and X', and the proofs are not part of the statement.

The third-level proofs of the sort above are inherently sequential. We can make the recursion parallelizable by supporting a generic “segment” proof:

  • Segment: This proof is for a tuple (L_0,X_0, L_n, X_n) where L_0,L_n are cumulative layer hashes and X_0,X_n are state hashes. The statement is that we know at least one of the following:
    • Either a sequence of hashes L_1,\dotsc,L_{n-1} and a chain of second-level proofs for tuples (X_0,B_1,X_1),\dotsc,(X_{n-1},B_n,X_n) such that for i\in\{1,\dotsc,n\} it holds that L_i = H(L_{i-1}||B_i) and all the proofs are valid OR
    • a sequence of segment proofs for tuples (L_0,X_0,L_1,X_1),\dotsc,(L_{n-1},X_{n-1},L_n,X_n) and all the proofs are valid.

The segment proof for a tuple (\bot,X_{\text{genesis}, L, X) is equivalent (in terms of what it’s claiming) to the third-level proof for tuple (L,X), but each segment in a chain can be proved in parallel.

Implementing recursive proofs

The recursive proofs are actually the easiest, conceptually. We just implement a fixed riscv program that accepts the “we know” witnesses as inputs, runs the zkvm verifier code to check the proofs and outputs the proof tuple (ideally, the zkvm will have precompiles that make verifying its own proofs reasonably efficient).

Implementing transaction proofs

We can always take our full (non-zk) validator client, that emulates riscv internally to run transaction code, and compile that client to riscv, then run the entire validator client in the VM with the contract code and the inputs as its input. This is essentially what the EVM-over-riscv solutions are doing. However, this has a huge overhead – every opcode in the contract method translates to many thousands of opcodes required to interpret the opcode.

Here’s a proposal for how to efficiently prove transaction execution using a black-box model of the type above (efficient here means low overhead compared to running the raw riscv contract code in the VM).

I’m assuming that text/code segments (i.e., program code) can’t be written or read by the program itself – otherwise things become a lot hairier. (My intuition is that adding this type of memory protection to a zkvm that doesn’t support it shouldn’t be too hard — essentially it needs a separate memory area for a fetch as opposed to a load. )

High-level design

Execution model outline (non-proving)

As far as the account system is concerned, execution happens in something like the following model (the account system doesn’t know about proofs internally):

  • Contract code (as committed to the chain) is “raw” riscv code, but with system calls defined by our account model API (not the prover VM syscalls). System calls use the risc-v ECALL opcode.

  • Each method is executed in its own “pristine” environment, with all the RAM initialized to zero (except for some predefined address that contains the initial input) and all registers set to a known initial configuration.

  • We haven’t fully designed our account API, but for the sake of this example let’s say each account has a key/value store, contract methods can only directly read/write their own account data, and accessing data from other accounts is via cross-contract calls.

    Here are some example system calls for this API (some of these might only be available as a separate transaction, not within a contract method):

    • Deploy a new contract template
    • Create a new account
    • Read state object (given a key, write the contents of the corresponding value into RAM)
    • Write state object (given a key and a pointer to RAM, write the value from RAM intro the key/value store for the account)
    • Call account method (given a target account id, target method id and input values, execute that method — in its own context – and return the result)
  • For the purposes of this discussion, I’m ignoring how transactions specify which contract methods to run and how authorization works – e.g., think of authorization as part of the trusted entry code for the method.

Proving execution

The general structure for proving transaction execution is as follows:

  • Code transformation: The “raw” contract code is transformed (using a simple, deterministic algorithm) into “zkvm-able” code. This entails adding additional trusted “library code” to the executable, and replacing account API system calls with standard risc-v calls to the library functions (the details of how this happens are below).
  • VM Execution: The zkvm-able code is sent to the VM and executed there, with the (transformed) transaction inputs, The translated account model API syscalls will require interactive communication with the VM, which is handled by the host (our validator client).
  • Output and Proof: The output is parsed to get the gas usage , new state information, and the ZK proof statement.

I’ll give examples for transformations (these are optimized more for clarity than efficiency – it’s possible to do things more efficiently, but the point here is the technique).

  • Entry code / additional inputs for standard contract calls (deploying a template is a separate issue):

    • Every contract method is given as input not only the global state hash X, but the account state hash Y and a merkle path from Y to the global hash. The entry code verifies the merkle path and stores Y in a location that is inaccessible to the raw contract code. (one possible way to do this is to store it in registers that are guaranteed not to be used by the contract code – for example, if we restrict contracts to use rv32em instructions, then there are 16 registers guaranteed not to be used). The entry code outputs the initial global state hash (the first value of the (X,T,X') proof tuple.
  • Read state object. This system call is translated to a library call that sends a request for object key to the host. The host will return the object data, along with a merkle path to the account state hash Y (or a path showing the key doesn’t exist). The library function will then verify the merkle path, and if it is valid will return the object data.

  • Write state object. This system call is translated to a library call that sends a write request to object key with value val to the host. The host will return a new account state root Y',
    the object data, along with a merkle path to the account state from the original value stored under key. The library function will check that updating only the single value and recomputing the hashes along the merkle path results in account state root Y'. If so, it updates the register containing the current account state.

  • Cross-contract call: this system call is replaced by library function that does the following:

    • Compute the updated global state hash X' (using the updated local account state hash, and the merkle path from Y to the global hash)$. Send this updated hash to the host, along with a request to call another contract (and the relevant parameters for the call).
    • Receive from the host a the further updated state hash, X'' due to the cross-contract call, and a new merkle path from the local account hash Y' to X''. In addition, the host must send a zkvm proof for the tuple (X',T',X''), where T' is the transaction equivalent to the cross-contract call.

    This is extremely expensive since it requires a recursive verfication. For a subset of methods, which we can syntactically verify do not make any state changes or writes to RAM (we just check that the machine code doesn’t contain store opcodes or write-object system calls), I think we can just run the method call directly, without changing contexts or requiring recursive proofs.

  • Exit code a “halt” opcode in the original contract program is also transformed into a library call, this time to trusted exit code. The exit code uses the merkle path from the original account state hash to the global state hash in order to compute the updated global state hash from Y', the current account state hash (stored in a register). It outputs the updated global state hash (it will also output and return value from the method, as well as gas usage).

  • Deployment deploying a contract is special; this would run our own predefined code that checks the machine code of the deployed contract (e.g., makes sure it only uses rv32e opcodes and valid system calls). It can also run the transformation algorithm, and generate the image id for the transformed algorithm. We can think of the ID as part of the template state, This way, we don’t have to run the transformation for every transaction – the entry code can just check that its image id matches the id that its corresponding template has stored.

Gas metering without hardware support

If the zkvm doesn’t support gas metering in “hardware”, I think we can still do fairly efficient metering using the code transformation technique. The idea is to use a reserved register to keep a gas counter (again, restricting contracts to rv32e will help here).

(The explanation requires the notion of a basic block: a basic-block is a sequence of opcodes that don’t have any conditional branches – i.e., if the first opcode is executed then they are guaranteed to all run sequentially)

For every basic-block in the original contract code, we add a small snippet of code that adds the appropriate value to the gas counter (since the basic block is guaranteed to either all run or not run at all, and we know the gas costs of each opcode, the amount we add to the counter is fixed). In addition, there is a check that the gas counter register is less than the gas limit (held in another reserved register). We add the same snippet after every system call (with a special one for cross-contract calls, where the return value from the call contains the gas expended in the call itself).

If everything can be kept in registers, it seems that the overhead here can be quite low (a few opcodes per basic-block).

A disadvantage of this method, however, is that every time the gas table changes, the image id of the transformed code would change as well – for all contracts. Using a dynamic gas table (where the gas table is given as input) seems like it would be a lot less efficient, but maybe there’s a nice way to do it.

2 Likes

I’m trying to understand how R0 and SP1 control memory access. Unfortunately this isn’t documented in either project.

I don’t see any controls in SP1:

R0 appears to have some controls:

Still working on it.

1 Like

Heard back from RiscZero:

we don’t implement any segmented memory with different attributes. We did experiment with it early on (we had sections of write-once memory), but this was difficult to make run with other features so we punted on it for the current design. The next version of the circuit is much more flexible and performant, so we may try again to implement something compatible with CHERI

2 Likes

That sounds promising… A fully-fledged memory protection model could allow us to do cross-contract calls much more efficiently, and without requiring a new VM instance.

There’s actually another way to do cross-contract calls without a new VM instance, but I’m not sure how it compares in terms of efficiency to the recursive-proof variant:

Whenever contract A calls contract B, instead of pausing the current VM, and starting execution in a new VM, we call a library function that writes the entire RAM contents to the host, stores a hash of the current RAM state in a reserved register, then erases the entire RAM and runs the contract B code in the same VM. When contract B code is done, the exit code will reload RAM from the host (keeping only the return value from contract B), verify that the RAM hash matches the stored value, and continue execution from where it left off.

Since we need to store the RAM hash from A’s execution somewhere that B’s code can’t read or modify it, in the absence of memory protection we probably need to reserve a register for this too (and limit recursion to a very small constant).

1 Like

Here are some additional resources on CHERI. I agree that it sounds promising but it’s complex and feels like overkill for our purposes, and I imagine it’ll take a long time to be finalized and make its way into implementations (the RISC-V spec is currently being worked on).

I also found this proposal which has already been ratified by RISC-V: Smepmp.pdf - Google Drive

I think this is fine as a starting point. Solana for instance currently limits the cross-contract call frame stack depth to four.

1 Like

I think the sort of process isolation we’re talking about just isn’t a main goal for the ZK projects. They run a single program, one at a time, in an interpreter. There’s no re-use of containers or modules or anything, no permissioning, no chance of a program getting access to something it’s not supposed to. It’s a fundamentally different runtime environment than a blockchain VM.

And I don’t think it matters from the perspective of a zkVM whether a program can read or write its own code. (Can you explain again why that makes things harder?)

PolkaVM’s approach is actually to run every program in a separate, sandboxed process, a bit like Docker. See the description here:

1 Like

You’re right, but I think it should be a goal, exactly because it allows proving statements about the execution of untrusted code. This is very useful even in a non-blockchain context, whenever you want to prove the output of a program that has code as input. For example, suppose you have an auction, where each participant sends code that “bids”, and you want to prove the result of the auction. Or if you want to prove that the result of a protocol execution involving several parties (each of which has their own code) has a specific output…

Allowing the code to read itself makes things harder if you want to transform the code before proving it. For programs that don’t read themselves, you can write generic compilers/optimizers/etc. that take an existing binary and produce a new binary that has exactly the same input/output behavior. Again, this can be useful even in a non-blockchain context. For example, you might design an optimizer that takes a risc-v binary that is efficient for running in hardware, and produces a new binary that is more efficient for running in a zkvm. If the program can read itself, there’s no way we can generically guarantee that input/output behavior is preserved, so the output of the zkvm execution doesn’t give you confidence about the output of the non-zkvm execution (to see why it’s not possible, think of a program that hashes itself, and outputs the hash — there’s no way you can modify this program and still get the same output without breaking the collision-resistance of the hash).

Allowing code to write itself is an order of magnitude worse, if you have untrusted code. For example, in the scheme I described above, we have to replace each system call with a wrapper that does something else, and makes various verification checks. If the code can modify itself, it can also modify this wrapper, and skip the verification.

2 Likes

Here’s the actual, complete set of syscalls from a production embedded blockchain VM. We may want a slightly different list but we’ll need to make sure we account for most of these.

1 Like

This list is what we want to assume the zkvm allows, not what we want our contract API to support. The point is that we need to translate a program that uses our account API into (which will probably look a lot more like the list from stylus that you posted above) into a program that can run on the zkvm.

Of course, many of the account API calls can be treated in the same way (e.g., “account balance” and “storage load” are both instances “read object from state”, and even things like “get gas price” are essentially the same, albeit reading from a different external state object).

2 Likes