Here are some thoughts about how to prove transactions using risc0/sp1 as a black box.
Motivation
First, I think a good way to conceptualize what we’re trying to do is to think of the zkvm as an actual remote hardware device, that’s not under our control, but can be completely trusted to run according to its specification. We can’t modify the hardware and software on the vm device, but if the specification says “return the output of the running program”, then we can trust that this is what is returned.
The reason we want to design a proof system in this way is the same reason we’d rather write code for a general-purpose CPU as opposed to designing an FPGA — it’s much easier to do and more maintainable, it’s less dependent on a specific zk project, and the set of developers who have the required expertise is much larger. The cost is some efficiency, which will hopefully not be too large.
The reason I feel we should think about this now, even though we don’t intend to support ZK proofs in our initial version, is twofold:
- Some proof methods might only be feasible if our contract risc-v code obeys certain restrictions. This means that in order to make proving easy in the future, we might want to restrict the set of programs we allow in the initial version of our (non-ZK) VM. (TL;DR, it might make sense to support only rv32em — a subset of r32im that has 16 registers rather than 32)
- We might want to collaborate with one (or more) of the risc-v zkvm projects to add features that would help us.
Black-box model
This is how I understand the bare-bones interface provided by risc0/sp1. In the description, Alice is the user and VM is the remote system:
-
Alice sends the VM an executable, consisting of risc-v machine code and an initial input. The machine code consists of rv32im instructions, and allows the following system calls:
- Read from host.
- Write to host
- Output
- Crypto accelerators (e.g., run sha256 or poseidon hash)
-
The VM loads the machine code into memory and executes it.
-
Whenever it encounters a “read from host” syscall, it sends a message to Alice, and waits for her to send data. Once she sends the data, it writes it into the VM memory (in some well-defined location) and continues executing the code.
-
Whenever it encounters a “write to host” syscall, it sends the data to Alice.
-
Whenever it encounters an “output” syscall, it outputs the data to the “proof journal”.
-
The program terminates when it encounters an error (e.g., some CPU exception) or a halt command.
-
When the program terminates, the VM “signs” a statement and sends it (with its signature) to Alice. (this corresponds to the statement that is proven by the ZK proof). Let X be the hash of the program Alice sent, o_1,\dotsc,o_n the set of output values it produced using the “output” system call and r the termination reason (e.g. halt, div-by-zero, memory exception, etc.). The statement signed by the VM is:
- "I executed a program with hash X, for which I know there exist inputs such that the resulting execution outputs were o_1,\dotsc,o_n, and the program ended with reason r.
-
Note that the actual inputs don’t appear as part of the signed statement. This is important so that recursive proofs don’t grow with the level of recursion. If we do want the proof statement to include the inputs (or, for example, a hash of the inputs), we can always add code to the program copy the inputs to the output (or hash them and output the hash).
Open Questions
There are some critical questions about how exactly the black-box model works, that will greatly affect proving strategies and efficiency. Also, ideally, we would want the VM to provide a few extra services, which I don’t think the current risc0/sp1 provides (but I’m putting this in “open questions” anyway).
-
Access to program code: the proof strategy I outline below assumes the program code cannot be modified by the program itself during execution. Moreover, things become much nicer if the program code can’t even be read by the program itself (this allows us to transform programs easily before sending them to the VM, while guaranteeing that their input/output behavior is unchanged)
I think the risc-v spec leaves it to the system to define which areas of memory can be accessed by a load/store, and I’m not sure what risc0/sp1 do here.
-
Gas metering: our transaction model (and many others, of course) require more than just the input/output behavior of the program – they need to know how many “virtual cycles” the program took to run (virtual cycles should be a measure of how hard the program is to prove, rather than to simply execute). Moreover, if the program takes too many cycles, they need the VM statement to say something like "the program failed because it ran more than z cycles. riscv actually defines a special register than counts the number of cycles executed so far, and opcodes to read this register (the “zicntr” extension), but I think they aren’t implemented by risc0/sp1 (we would probably need an implementation that uses an external “gas table”, since we want to control the actual instruction costs)
Proposed Proof Technique
Overview
What we actually want to prove are statements that roughly have the following forms:
-
First-level (Tx): This proof is for a tuple (X_0,T,X_1), where X_0 and X_1 are state hashes and T is a transaction hash. The statement is:
Starting with state-hash X_0, if we run a transaction with hash T then the state-hash after the execution is X_1.
(There will actually be additional information in the tuple, such as method return values and gas usage, but we’ll ignore that for now, since it’s not relevant for the high-level description)
-
Second-level (Block): This proof is for a tuple (X_0,B,X_n) where X_0 and X_n are state hashes and B is a block hash. The statement is:
We know a chain of first-level proofs for some set of tuples of the form (X_0,T_1,X_1),\dotsc,(X_{n-1},T_n,X_n), such that H(T_1||\dotsb||T_n)=B and all the proofs in the chain are valid.
Note that the values of the chain and the proofs are not part of the statement.
-
Third-level (Blockmesh): This proof is for a tuple (L,X), where L is a cumulative layer hash and X a state hash. The statement is:
- Either L=\bot and X=X_{\text{genesis}} or
- We know a cumulative layer hash L', a block-hash B, a third-level proof for a tuple (L',X') and a second-level proof for tuple (X', B, X), such that H(L'||B)=L and both proofs are valid.
Note that the values of L', B and X', and the proofs are not part of the statement.
The third-level proofs of the sort above are inherently sequential. We can make the recursion parallelizable by supporting a generic “segment” proof:
- Segment: This proof is for a tuple (L_0,X_0, L_n, X_n) where L_0,L_n are cumulative layer hashes and X_0,X_n are state hashes. The statement is that we know at least one of the following:
- Either a sequence of hashes L_1,\dotsc,L_{n-1} and a chain of second-level proofs for tuples (X_0,B_1,X_1),\dotsc,(X_{n-1},B_n,X_n) such that for i\in\{1,\dotsc,n\} it holds that L_i = H(L_{i-1}||B_i) and all the proofs are valid OR
- a sequence of segment proofs for tuples (L_0,X_0,L_1,X_1),\dotsc,(L_{n-1},X_{n-1},L_n,X_n) and all the proofs are valid.
The segment proof for a tuple (\bot,X_{\text{genesis}, L, X) is equivalent (in terms of what it’s claiming) to the third-level proof for tuple (L,X), but each segment in a chain can be proved in parallel.
Implementing recursive proofs
The recursive proofs are actually the easiest, conceptually. We just implement a fixed riscv program that accepts the “we know” witnesses as inputs, runs the zkvm verifier code to check the proofs and outputs the proof tuple (ideally, the zkvm will have precompiles that make verifying its own proofs reasonably efficient).
Implementing transaction proofs
We can always take our full (non-zk) validator client, that emulates riscv internally to run transaction code, and compile that client to riscv, then run the entire validator client in the VM with the contract code and the inputs as its input. This is essentially what the EVM-over-riscv solutions are doing. However, this has a huge overhead – every opcode in the contract method translates to many thousands of opcodes required to interpret the opcode.
Here’s a proposal for how to efficiently prove transaction execution using a black-box model of the type above (efficient here means low overhead compared to running the raw riscv contract code in the VM).
I’m assuming that text/code segments (i.e., program code) can’t be written or read by the program itself – otherwise things become a lot hairier. (My intuition is that adding this type of memory protection to a zkvm that doesn’t support it shouldn’t be too hard — essentially it needs a separate memory area for a fetch as opposed to a load. )
High-level design
Execution model outline (non-proving)
As far as the account system is concerned, execution happens in something like the following model (the account system doesn’t know about proofs internally):
-
Contract code (as committed to the chain) is “raw” riscv code, but with system calls defined by our account model API (not the prover VM syscalls). System calls use the risc-v ECALL opcode.
-
Each method is executed in its own “pristine” environment, with all the RAM initialized to zero (except for some predefined address that contains the initial input) and all registers set to a known initial configuration.
-
We haven’t fully designed our account API, but for the sake of this example let’s say each account has a key/value store, contract methods can only directly read/write their own account data, and accessing data from other accounts is via cross-contract calls.
Here are some example system calls for this API (some of these might only be available as a separate transaction, not within a contract method):
- Deploy a new contract template
- Create a new account
- Read state object (given a key, write the contents of the corresponding value into RAM)
- Write state object (given a key and a pointer to RAM, write the value from RAM intro the key/value store for the account)
- Call account method (given a target account id, target method id and input values, execute that method — in its own context – and return the result)
-
For the purposes of this discussion, I’m ignoring how transactions specify which contract methods to run and how authorization works – e.g., think of authorization as part of the trusted entry code for the method.
Proving execution
The general structure for proving transaction execution is as follows:
- Code transformation: The “raw” contract code is transformed (using a simple, deterministic algorithm) into “zkvm-able” code. This entails adding additional trusted “library code” to the executable, and replacing account API system calls with standard risc-v calls to the library functions (the details of how this happens are below).
- VM Execution: The zkvm-able code is sent to the VM and executed there, with the (transformed) transaction inputs, The translated account model API syscalls will require interactive communication with the VM, which is handled by the host (our validator client).
- Output and Proof: The output is parsed to get the gas usage , new state information, and the ZK proof statement.
I’ll give examples for transformations (these are optimized more for clarity than efficiency – it’s possible to do things more efficiently, but the point here is the technique).
-
Entry code / additional inputs for standard contract calls (deploying a template is a separate issue):
- Every contract method is given as input not only the global state hash X, but the account state hash Y and a merkle path from Y to the global hash. The entry code verifies the merkle path and stores Y in a location that is inaccessible to the raw contract code. (one possible way to do this is to store it in registers that are guaranteed not to be used by the contract code – for example, if we restrict contracts to use rv32em instructions, then there are 16 registers guaranteed not to be used). The entry code outputs the initial global state hash (the first value of the (X,T,X') proof tuple.
-
Read state object. This system call is translated to a library call that sends a request for object key to the host. The host will return the object data, along with a merkle path to the account state hash Y (or a path showing the key doesn’t exist). The library function will then verify the merkle path, and if it is valid will return the object data.
-
Write state object. This system call is translated to a library call that sends a write request to object key with value val to the host. The host will return a new account state root Y',
the object data, along with a merkle path to the account state from the original value stored under key. The library function will check that updating only the single value and recomputing the hashes along the merkle path results in account state root Y'. If so, it updates the register containing the current account state. -
Cross-contract call: this system call is replaced by library function that does the following:
- Compute the updated global state hash X' (using the updated local account state hash, and the merkle path from Y to the global hash)$. Send this updated hash to the host, along with a request to call another contract (and the relevant parameters for the call).
- Receive from the host a the further updated state hash, X'' due to the cross-contract call, and a new merkle path from the local account hash Y' to X''. In addition, the host must send a zkvm proof for the tuple (X',T',X''), where T' is the transaction equivalent to the cross-contract call.
This is extremely expensive since it requires a recursive verfication. For a subset of methods, which we can syntactically verify do not make any state changes or writes to RAM (we just check that the machine code doesn’t contain store opcodes or write-object system calls), I think we can just run the method call directly, without changing contexts or requiring recursive proofs.
-
Exit code a “halt” opcode in the original contract program is also transformed into a library call, this time to trusted exit code. The exit code uses the merkle path from the original account state hash to the global state hash in order to compute the updated global state hash from Y', the current account state hash (stored in a register). It outputs the updated global state hash (it will also output and return value from the method, as well as gas usage).
-
Deployment deploying a contract is special; this would run our own predefined code that checks the machine code of the deployed contract (e.g., makes sure it only uses rv32e opcodes and valid system calls). It can also run the transformation algorithm, and generate the image id for the transformed algorithm. We can think of the ID as part of the template state, This way, we don’t have to run the transformation for every transaction – the entry code can just check that its image id matches the id that its corresponding template has stored.
Gas metering without hardware support
If the zkvm doesn’t support gas metering in “hardware”, I think we can still do fairly efficient metering using the code transformation technique. The idea is to use a reserved register to keep a gas counter (again, restricting contracts to rv32e will help here).
(The explanation requires the notion of a basic block: a basic-block is a sequence of opcodes that don’t have any conditional branches – i.e., if the first opcode is executed then they are guaranteed to all run sequentially)
For every basic-block in the original contract code, we add a small snippet of code that adds the appropriate value to the gas counter (since the basic block is guaranteed to either all run or not run at all, and we know the gas costs of each opcode, the amount we add to the counter is fixed). In addition, there is a check that the gas counter register is less than the gas limit (held in another reserved register). We add the same snippet after every system call (with a special one for cross-contract calls, where the return value from the call contains the gas expended in the call itself).
If everything can be kept in registers, it seems that the overhead here can be quite low (a few opcodes per basic-block).
A disadvantage of this method, however, is that every time the gas table changes, the image id of the transformed code would change as well – for all contracts. Using a dynamic gas table (where the gas table is given as input) seems like it would be a lot less efficient, but maybe there’s a nice way to do it.