Redundant PoET Registration

Motivation

PoET servers are a potential single-point-of-failure. There are many scenarios where PoETs can be attacked and possibly fail. In addition, a single widely used PoET service provides its operators centralized control of the network. If a widely used PoET service is malicious or attacked it can have detrimental effects on the network.

Mitigation

Allow smeshers to register for multiple PoET services and pick one proof to use in their ATX.

Assumption

This is only relevant if there are actually multiple entities that run PoET services. If in reality only the Spacemesh company will run a PoET service then it gets the power to censor smeshers and this mitigation won’t be effective.
Having said that, allowing smeshers to register to multiple PoETs will encourage running additional PoETs. E.g. users will be able to run their own PoET as a backup even if it’s slower than what Spacemesh provides and it will only be used as a fallback.

Implementation

Configuration

The current config includes:

PoETServer string `mapstructure:"poet-server"`

Instead of a single string, it should include a list of strings.

Submitting and Querying

We currently instantiate an HTTPPoetClient in node.go->App.Start() based on the configured PoET address. We then pass it along to the NIPostBuilder in nipost.go and call Submit() from BuildNIPost().
We don’t currently query the PoET server, since the existing implementation awaits a proof via gossip.
I think the best way to update this flow is to create a list of HTTPPoetClients and iterate over it every time we need to submit a challenge, calling the same method with the same arguments on each one.
It’s also possible to create a smarter client that internally holds a list of servers and does the iterating on its own. I feel that this idea breaks down when querying, since it means either returning a list of responses, or putting the logic for picking a PoET proof inside the client - both options sound bad to me.

Selecting a PoET Proof

When the target time arrives (see below), in nipost.go->BuildNIPost(), under // Phase 1: receive proofs from PoET service, all PoETs should be queried. Then the node must select a single proof to include in its ATX. The valid PoET proof where the smesher is a member, with the highest tick count should be selected.
If no valid PoET proof where the smesher is a member can be found, and some PoETs didn’t provide a proof at all (e.g. they are late) the node will keep querying at regular intervals until the timeout arrives.
In order to reduce the risk of missing another epoch due to a late or failed PoET, the smesher should give up and submit a challenge for the next epoch towards the end of the PoET registration time, which should also be configured.

Determining the target time

There are generally 3 relevant deadlines to publish an ATX:

  1. Just before the next PoET round begins.
  2. Just before the target epoch begins.
  3. Just before the last round of the target epoch begins.

Ideally, deadline 1 should be used, so that there’s no slippage. If a smesher published their previous ATX in layer 1000 of the epoch, they should keep doing so. Otherwise smeshers will all slowly (or quickly) slip closer and closer to the second deadline. In the future we may implement incentives or rules that should prevent this from happening.
Deadline 2 is the last opportunity to publish the ATX before the target epoch and not risk missing an eligibility (e.g. in case the smesher is eligible for a ballot in the first layer of the epoch). If we wait for that deadline, unless there’s a PoET that starts right at that time (which we’re not going to do) then we miss the next epoch, so we probably shouldn’t.
The 3rd deadline is the last opportunity to get any eligibility out of the ATX. This should only be used as a timeout. E.g. if the node is offline and when it comes back online it sees that it has an unpublished ATX and this deadline hasn’t passed yet - then it may be worth publishing, but if the time passed - the node can just discard it.

So in order to publish the ATX before the next PoET round begins, the node needs to know:

  1. When the next PoET round begins (for genesis we can hardcode a layer relative to the epoch and just assume that all PoETs start at that time - we’re not doing phased PoETs).
  2. How long it takes the node to query the PoET, run a PoST proof, construct an ATX and submit a challenge to the PoET for the following epoch. Let’s call this length of time “the PoET delta”. See below for how to determine it.

Determining the PoET delta

The delta depends partly on the hardware used, but mostly on the number of space units, since reading all this data takes time.
We should define some basic time it takes per space unit. We can determine this by benchmarking on the lower end of the supported hardware spectrum. The node will multiply this time by the number of space units allocated.
When we calculate the PoST proof, we should measure the time it takes and next time use this time plus some buffer (e.g. 10%) as the delta.
At genesis, the PoET will use some big delta that should allow home miners enough time, but in the future we may do different things, like post a proof early and then update it with more ticks as those accumulate or have different PoETs with different deltas.

we discussed in the past that we will wait till a reasonable last minute to build ATX, even tho we have already receive multiple proofs. i think we should change to that model at this point in time too, in case a different poet service is set up with different configuration.

if you agree, can you specify what timing is safe?

Thanks for reminding me! Added it to the spec.

@noam, how to check the tick count of a PoET proof? I can’t find such a field in go-spacemesh/activation.go at cd604a5eb9d21212f19efb392b7c9615a754c8c4 · spacemeshos/go-spacemesh · GitHub

tickCount = types.PoetProof.LeafCount/app.Config.TickSize

it’s the same as the 2nd arg to ATX.Verify() in the code go-spacemesh/activation.go at df86951a5b59f66b03fb445933365f29baf22638 · spacemeshos/go-spacemesh · GitHub

2 Likes

What about the NodeService GRPC API that allows updating a PoET service (api/node.proto at 1907c2fe076e1f5f754bd130adce396a4dff7493 · spacemeshos/api · GitHub)?

So far it allowed changing the current PoET service to a new one. Does the GRPC NodeService need to be extended to allow adding and removing PoET services?

Great question. I think that the simplest solution for now would be to allow replacing the list of PoET services. So the endpoint would just accept a list instead of an entry and replace the existing one, like today.

This makes me wonder how good a job we did when we implemented the PoET replacement endpoint. Does the node know how to properly handle the replacement? Specifically, does the activation module persistently store which PoET server it submitted a challenge to, so that it would know to query that in case the config changed? To be clear, the config should only affect the next time the node registers for the PoET.

It might be a bit different today because we listen for the proof on gossip, so we only need to store the PoET ID and round that we listen for. Now that we need to query for the PoET proof, we need to store the actual PoET address (IP+port / URL).

poet that was updated over API is not written to the config or elsewhere. if someone updated it using API and that change needs to be persistent across restarts - then updating config file is also required.

Whether it should be permanent or not is another story. @dmitry could you please open a separate topic to discuss it?