Redundant PoET Registration

Motivation

PoET servers are a potential single-point-of-failure. There are many scenarios where PoETs can be attacked and possibly fail. In addition, a single widely used PoET service provides its operators centralized control of the network. If a widely used PoET service is malicious or attacked it can have detrimental effects on the network.

Mitigation

Allow smeshers to register for multiple PoET services and pick one proof to use in their ATX.

Assumption

This is only relevant if there are actually multiple entities that run PoET services. If in reality only the Spacemesh company will run a PoET service then it gets the power to censor smeshers and this mitigation won’t be effective.
Having said that, allowing smeshers to register to multiple PoETs will encourage running additional PoETs. E.g. users will be able to run their own PoET as a backup even if it’s slower than what Spacemesh provides and it will only be used as a fallback.

Implementation

Configuration

The current config includes:

PoETServer string `mapstructure:"poet-server"`

Instead of a single string, it should include a list of strings.

Submitting and Querying

We currently instantiate an HTTPPoetClient in node.go->App.Start() based on the configured PoET address. We then pass it along to the NIPostBuilder in nipost.go and call Submit() from BuildNIPost().
We don’t currently query the PoET server, since the existing implementation awaits a proof via gossip.
I think the best way to update this flow is to create a list of HTTPPoetClients and iterate over it every time we need to submit a challenge, calling the same method with the same arguments on each one.
It’s also possible to create a smarter client that internally holds a list of servers and does the iterating on its own. I feel that this idea breaks down when querying, since it means either returning a list of responses, or putting the logic for picking a PoET proof inside the client - both options sound bad to me.

Selecting a PoET Proof

When the target time arrives (see below), in nipost.go->BuildNIPost(), under // Phase 1: receive proofs from PoET service, all PoETs should be queried. Then the node must select a single proof to include in its ATX. The valid PoET proof where the smesher is a member, with the highest tick count should be selected.
If no valid PoET proof where the smesher is a member can be found, and some PoETs didn’t provide a proof at all (e.g. they are late) the node will keep querying at regular intervals until the timeout arrives.
In order to reduce the risk of missing another epoch due to a late or failed PoET, the smesher should give up and submit a challenge for the next epoch towards the end of the PoET registration time, which should also be configured.

Determining the target time

There are generally 3 relevant deadlines to publish an ATX:

  1. Just before the next PoET round begins.
  2. Just before the target epoch begins.
  3. Just before the last round of the target epoch begins.

Ideally, deadline 1 should be used, so that there’s no slippage. If a smesher published their previous ATX in layer 1000 of the epoch, they should keep doing so. Otherwise smeshers will all slowly (or quickly) slip closer and closer to the second deadline. In the future we may implement incentives or rules that should prevent this from happening.
Deadline 2 is the last opportunity to publish the ATX before the target epoch and not risk missing an eligibility (e.g. in case the smesher is eligible for a ballot in the first layer of the epoch). If we wait for that deadline, unless there’s a PoET that starts right at that time (which we’re not going to do) then we miss the next epoch, so we probably shouldn’t.
The 3rd deadline is the last opportunity to get any eligibility out of the ATX. This should only be used as a timeout. E.g. if the node is offline and when it comes back online it sees that it has an unpublished ATX and this deadline hasn’t passed yet - then it may be worth publishing, but if the time passed - the node can just discard it.

So in order to publish the ATX before the next PoET round begins, the node needs to know:

  1. When the next PoET round begins (for genesis we can hardcode a layer relative to the epoch and just assume that all PoETs start at that time - we’re not doing phased PoETs).
  2. How long it takes the node to query the PoET, run a PoST proof, construct an ATX and submit a challenge to the PoET for the following epoch. Let’s call this length of time “the PoET delta”. See below for how to determine it.

Determining the PoET delta

The delta depends partly on the hardware used, but mostly on the number of space units, since reading all this data takes time.
We should define some basic time it takes per space unit. We can determine this by benchmarking on the lower end of the supported hardware spectrum. The node will multiply this time by the number of space units allocated.
When we calculate the PoST proof, we should measure the time it takes and next time use this time plus some buffer (e.g. 10%) as the delta.
At genesis, the PoET will use some big delta that should allow home miners enough time, but in the future we may do different things, like post a proof early and then update it with more ticks as those accumulate or have different PoETs with different deltas.

we discussed in the past that we will wait till a reasonable last minute to build ATX, even tho we have already receive multiple proofs. i think we should change to that model at this point in time too, in case a different poet service is set up with different configuration.

if you agree, can you specify what timing is safe?

Thanks for reminding me! Added it to the spec.

@noam, how to check the tick count of a PoET proof? I can’t find such a field in https://github.com/spacemeshos/go-spacemesh/blob/cd604a5eb9d21212f19efb392b7c9615a754c8c4/common/types/activation.go#L302-L306

tickCount = types.PoetProof.LeafCount/app.Config.TickSize

it’s the same as the 2nd arg to ATX.Verify() in the code go-spacemesh/activation.go at df86951a5b59f66b03fb445933365f29baf22638 · spacemeshos/go-spacemesh · GitHub

2 Likes

What about the NodeService GRPC API that allows updating a PoET service (api/node.proto at 1907c2fe076e1f5f754bd130adce396a4dff7493 · spacemeshos/api · GitHub)?

So far it allowed changing the current PoET service to a new one. Does the GRPC NodeService need to be extended to allow adding and removing PoET services?

Great question. I think that the simplest solution for now would be to allow replacing the list of PoET services. So the endpoint would just accept a list instead of an entry and replace the existing one, like today.

This makes me wonder how good a job we did when we implemented the PoET replacement endpoint. Does the node know how to properly handle the replacement? Specifically, does the activation module persistently store which PoET server it submitted a challenge to, so that it would know to query that in case the config changed? To be clear, the config should only affect the next time the node registers for the PoET.

It might be a bit different today because we listen for the proof on gossip, so we only need to store the PoET ID and round that we listen for. Now that we need to query for the PoET proof, we need to store the actual PoET address (IP+port / URL).

poet that was updated over API is not written to the config or elsewhere. if someone updated it using API and that change needs to be persistent across restarts - then updating config file is also required.

Whether it should be permanent or not is another story. @dmitry could you please open a separate topic to discuss it?

I don’t see it as the node’s responsibility to update the config. Let’s take SMApp as an example, but it should be the same for any client. If a user changes something in the UI, the SMApp should do two things (btw, not limited to PoET changes):

  1. Change the config, so this change is persisted next time the node is restarted.
  2. Call an API to make the change take immediate effect.

As I see it, the SMApp is the owner of the config. It does this today already for the smeshing config: it calls the StartSmeshing endpoint and at the same time updates the config for the next node start up.

There are other possible approaches, but I like what I described.

Some alternatives are:

  1. Do what @dmitry said and have the node manage the config. Updating it after relevant API calls.
  2. Have the node monitor config changes, and remove these APIs all together. I like this approach, but it breaks our current flow with passing command line args as config options.
  3. Hybrid of the other options - might end up being confusing and hard to maintain.

Most importantly, and regardless of what we choose here, the node has to, independently of the config, persist the list of servers registered with. This is important in order to allow changing the list of servers for future registrations without losing track of previous registrations.

It’s simplest to explain this in the case of a single PoET:

  • You register and wait for the proof.
  • During that time you decide to change to a different PoET.
  • When the time comes to use the proof you have the wrong PoET configured and it doesn’t have your proof.

Determining the PoET delta

maybe it should be called cycle-gap, so that things are consistent?

When we calculate the PoST proof, we should measure the time it takes and next time use this time plus some buffer (e.g. 10%) as the delta.

this delta (previusly known as cycle-gap) needs to be known in advance
by the poet instance. if node can’t prepare post during that time - it
can’t use a particular poet as it will always submit challenge after
execution was started, so it is clear that max allowed wait time is
bounded by cycle-gap.

if node is using poet it needs to known when that poet round starts and
terminates. if it knows when the round terminates then the proof should
be available at the termination time plus some wait-time depending on
method of delivery (e.g broadcast or direct). this delta should be a
part of the configured cycle-gap.

measuring post time allows to dynamically adjust wait-time, but it doesn’t do anything except increasing complexity

Yeah, “cycle-gap” is better than “delta”, I agree.

Yes, the PoET has a fixed cycle gap for all subscribers (at least for now). Perhaps in the future we’ll want PoETs to publish several proofs so that nodes that require a shorter cycle gap (faster disk / less allocated storage) can take advantage of the extra ticks. But even now this is not a “network param” - each PoET can have a different gap and if you run your own you can customize it to match whatever your node needs.

Anyway, I do agree that if it considerably simplifies the implementation then we can live with a hardcoded cycle gap that’s shared between the PoET servers we operate and the nodes. Perhaps we can at least make it a config param so that node operators who do run their own PoET can customize it on both PoET and nodes.

this is implemented.