Greg Meredith continues the discussion on liveness with Isaac DeFrain and Christian Williams.
Greg: Today I want to talk about some practical things that have happened with the network dynamics and then talk about how the error-correcting code stuff directly addresses that. If you remember, several months ago we talked about error-correcting codes and how that gives us a way to deal with liveness issues.
Isaac: Could we revisit this stuff about error-correcting codes and how it relates to safety? It’s been a little while since I’ve thought about that, and maybe it’d be helpful for everybody else.
Greg: I’m going to do that for sure, but I’m going to do it in the context of this particular problem. First, I want to check in and see if you understand the problem.
Christian: What are we talking about?
Greg: What we’re seeing right now is a problem in network dynamic equilibria. The issue is that if you have a bunch of validators, all of whom are identical in spec, one validator can pull way ahead of the other validators. All the other validators are just processing blocks proposed by the initial validator. Some validator receives a deploy from a client, and then runs the contract associated with the deploy, and then does a replay as a part of the validation of that computation, and then forwards the proposal, the block, onto other validators.
At that point it’s checking of the block is in its past and it can go on to do other things. All the other validators have that block to deal with. Under certain kinds of conditions, with respect to network latency and things, that validator, even though they’re all equivalent in terms of network bandwidth and storage and compute capacity, that initial validator can race ahead of all the others and be the only one that’s proposing blocks in the network. Does that make sense?
Greg: If that’s the case, then obviously the network doesn’t have very much security. You’ve only got one validator ever doing anything.
Greg: Good. Okay, we understand the problem. Do you agree that this is a liveness problem?
Greg: Okay, good. The whole point of the error-correcting codes is to address liveness. In particular, to give a language for addressing liveness. Rather than go all the way back to the full-blown glory of the framework before, I’m going to talk about a constrained version of it and then slowly widen the scope of the discussion to where we can see why this is a subset of the full-blown version.
Now I’m going to propose a solution. The solution to this comes from the block structure that we use in Casper. In Casper, we have this idea of justifications. Every block that is received by a validator has a pointer to the hashes of blocks—essentially, previous blocks that justify this block.
A little bit about the history: this idea of using justification pointers goes all the way back to the game semantics proposed by Highland and Ong for linear logic. Since those days I was aware of the use of justification pointers and, essentially, they use justification pointers as a way to reason about the threadedness of particular computations.
In the 2008 timeframe, I started adapting that idea for network security. I was building a version of what we now have as RSpace. I wanted the nodes in that network of RSpace nodes to have a means of doing protocol security. I introduced the idea of justification pointers. You can go and look at the Special K Github and see the Scala code that gives a fairly abstract generic style implementation of justification pointers as a part of security for a network protocol.
Vlad and I were discussing this and hit upon the idea that we could use justifications to go after equivocation because essentially equivocation is a form of non-single threadedness. Out of that grew safety. This is the beginning of correct-by-construction Casper.
Now the idea is to use the justification structure to limit what are good justifications. Imagine that we divide the network from the perspective of a particular validator into player and opponent. The validators is the player, and all the other validators in aggregation we think of as opponent. They represent the environment. Hopefully, that gives some hint as to how this is related to game semantics.
If you have good alternation in the justification structure—player goes, opponent goes, player goes, opponent goes, something like that—then the rabbit can’t race ahead. Does that make sense?
Christian: Yeah. By opponent going, you mean everyone in the blob of other validators?
Greg: Just that we’re on the same page, what I’m talking about is the spanning tree of the justification dag of a particular block. What I mean by good alternation is the root of that tree is the block being proposed. The next layer down is a fringe of justification blocks. Then they have corresponding justifications, which give you a new fringe. What we’re saying is that the first fringe has to be opponent. It can’t be player.
As long as you keep that structure going—player, opponent, player, opponent—recursively throughout the spanning tree of the dag, then everybody’s alternating with their environment. The player can’t race ahead. What you can see in the data that we have is that the justification of the rabbit blocks is another rabbit block. The justification of that block is another rabbit block. Rabbits are self-justifying.
Christian: What’s a rabbit block?
Greg: A block proposed by the validator that is racing ahead of the others, which I’m calling a rabbit. What you get are these long chains of the rabbit proposes this block and the justification for that is this previous block that the rabbit proposed and so on and so forth. It’s this lack of alternation with the environment that is allowing the rabbit to race ahead.
Isaac: Yeah. It’s like validators not even really paying attention to what is going on in the environment. They’re just doing their own thing.
Greg: Exactly. So this alternation property becomes a liveness problem.
Christian: When you enforce this alternation for every validator considered as a player, doesn’t it become equivalent to not a single validator can go until all of the others have gone?
Greg: Very good question. That leads to the next refinement—you’re rabbiting ahead of me. There are two things you have to worry about. The first one is the very initial condition. In the initial condition, if you require that the validator’s first block that it’s ever produced or proposes has to have this justification from the other validators, everybody’s stuck. That’s a deadlock condition. Liveness and deadlock are closely related.
What you can allow for is an initial prefix. The first thing you can do is you say that the genesis block, which is the thing that kicks off everything, is really signed by all the validators. Because it’s signed by all the validators, any validator can use that as its initial justification. You could get off the starting block that way and then you can allow for an initial prefix. You have some number of blocks that bottoms out in the genesis block that is self-justified. But then after that, you have to have good alternation. That gives you enough wiggle room in the network to allow for there to be some decent flow of blocks around the network before you have to start paying attention to your environment.
Isaac: You’re saying you would have some kind of upper bound for these non-alternating blocks, and then after that point, they have to have this good alternation property?
Greg: That’s exactly right. The other refinement is you don’t necessarily want it to be all the rest of the validators. You just want a percentage of the environment that’s related to your stake. We can look at a lot of the dynamics, but because we’re just having a conversation here, one simplified version of this is just to say you have to have a third of the stake. That corresponds to traditional consensus literature.
When we were talking about this with Kent the other day, he was saying, well doesn’t this reduce the protocol just to PBFT—practical Byzantine fault tolerance? It doesn’t, because all this is about as the proposal of a block. You don’t know that this block won’t be orphaned. It doesn’t collapse back down to a practical Byzantine fault tolerance. We’re just giving us a practical limit. There are other limits that you could set. But if you do that, then you get liveness. You get a form of liveness that at least overcomes this particular network equilibrium.
We did this by common sense reasoning. You can reduce it to a formal proof. The more important thing is that you see the common-sense argument.
Christian: What does it mean to have a one-third stake in this context?
Greg: When you look at the validators and the justification, each of those validators has a corresponding stake. That’s what it means to be a validator. You want that stake to add the sum of the stake of the validators that are in the justification that is not the proposing validator need to sum to one-third of the total stake. You know the total stake because that’s in the genesis block. Knowing the legal set of validators is critical to the protocol function.
Christian: Maintaining this condition is what you’re referring to as the equilibria?
Greg: No, the equilibrium is the validator communicating without there being livelock or deadlock. But the condition gives you good equilibria. You can express this in Von Neumann-Nash-style games. You can talk about this literally as Nash equilibria. There’s a close relationship here, but I’m not going to sketch that out in this conversation.
Isaac: How does validator rotation influence this?
Greg: Essentially, when validator rotation happens, then you have to update the validator set and that will then potentially change who is allowed to be listed in justifications. There are lots of interesting recursive things that happened in the protocol because you don’t want validators to censor validators being rotated and so on. That’s a separate discussion from the one we’re having.
We’ve talked about some relaxation on the good alternation property. We allowed for an initial prefix and we’re looking at a percentage of the stake. The full liveness error-correcting code can be transformed into a specification of weighted grammars on the justification dag. When we talked last time, we introduced the space calculus, which allowed us to do this programmable liveness condition. The liveness conditions essentially looked like a context into which you deployed your process. Then we made it programmable by making context first-class. You can ship them around and that sort of thing. That’s what gave us this space calculus.
The basic idea of encoding the context can be transformed in a step-by-step manner into constraints on the justification dag. They tell you what are allowable communications; the view of the allowable communications is represented in the justification dag. This particular block is justified by these communications that I have seen. Those communications are justified by these communications that other validators have seen. This is how you practically realize this error-correcting code idea in the Casper protocol.
Let’s talk about how all the pieces fit together. There are a lot of moving parts and I just want to make sure that everyone understands the moving parts. A few months ago I sketched out how you can turn the error-correcting code idea of liveness into a context-based encoding, and then make that programmable, which gives you a language for talking about liveness constraints. What I didn’t say was how to relate that directly to the implementation in Casper.
We didn’t relate it to observed liveness issues in the running protocol. We said, here’s a nice theory. It has some lovely properties. It’s amenable to bisimulation and therefore constrainable by types and yada yada yada—lots of nice properties, but we didn’t connect those good properties to a running implementation.
Now we have the running implementation and it is facing real liveness issues that are observed by our SREs in Testnet-2. You can see the proposed times of validators lengthening under certain network conditions. It turns out that if you run particular constrained network conditions, which are natural when you’re just beginning to test the thing, you don’t observe them. It’s only under certain kinds of load that they start to take hold. It’s a subtle issue in network equilibria.
Isaac: What’s causing the increase in the propose times?
Greg: It’s because the rabbiting validator is able to get an edge up on all the other validators.
Isaac: This is undesirable behavior.
Greg: It’s undesirable behavior. We want to eliminate it. We’re able to demonstrate that it’s clearly a liveness issue. We can now give a crisp answer on how to address the liveness issue in terms of just the code of Casper. Now we can connect the dots and show that the solution is part of a larger solution that is contained within our theory of programmable liveness.
Isaac: That’s awesome.
Greg: This is how this kind of software has to be built. In particular, the methodology that we used here was to go slow and make sure that we observed real issues and then pried just enough of the theory to resolve the issue and connect the dots in that way.
There are two approaches to this. You start from a place of correctness and you only do transformations that preserve correctness. That’s what we’re doing. But the other side of that is there are some aspects of correctness that are insensitive to emergent issues like network equilibria.
You have to make sure that you’re willing to take both sides. You’re only doing correct transformations of code, but you’re also looking out for these emergent phenomena and making sure that your theory is capable of handling those, and then applying the theory to resolve those kinds of things in a disciplined manner. That’s what we have done here.
The nice thing is that it results in a fix that takes just a couple of lines of code to ensure this property. It can be rolled out in a week. We know that it’s grounded in an argument for correctness. We know that it’s connected to a theory that we can reason about equationally. That’s the raison d’être for CBC Casper. That’s why we do this.
Issac: Is there is a working implementation of this currently?
Greg: We found the problem. This Monday we had a long talk to make sure that we were in agreement about the parameters, because we noticed that the solution is parametric, so we wanted an agreement on the parameters. Now it’s ticketed. The work has been assigned to one of the engineers and is being implemented as we speak.
Christian: It’s not quite clear in my head how this is working. Can we possibly walk through a simple example of it in practice?
Greg: Sure. Let’s start with the problem. The problem is that the dupe contract is proposed and the dupe has a lot of recursion in it. It does many layers of recursion. The client deploys this to the validator. The validator then runs the contract, and then replays the contract, and then proposes that the block with that to the rest of the network.
The validator Alice has done that much work. Now all the replay is behind Alice. Alice is now free to respond to other client requests. If you have a client requesting, say on the order of every five seconds, the other validators, if they have taken up Alice’s work, then they’re stuck on that work, unable to give much attention to any client requests.
Alice can get further and further ahead. But the only reason that she can get further and further ahead because the blocks that Alice is using as justifications for the blocks she is proposing are blocks that Alice has proposed. If we insist that the genesis block is, since it’s signed by all validators, constitutes the totality of the stake of all the validators, then that can be used as a justification for any initial block proposed by a validator.
The other thing that we do is we insist on a condition that any block that is proposed, the blocks that justify it must contain at least one-third of the stake from other validators. We don’t care which ones they are, it just can’t be Alice or whoever is proposing the block
Christian: Are these requirements being encoded contextually in the space-time calculus?
Greg: That becomes a context. What you would do is you would turn that into a context, which is enforcing that kind of alternation. The context would sit there and wait and collect blocks. That’s how you connect the two. Once you have this alternation property, then Alice is forced to wait. She can’t rabbit ahead of the other validators because she can’t be so self-justified. Does that help?
Greg: Cool. Any thoughts, questions, comments about anything we talked about so far?
Isaac: It’s pretty obvious from looking at the justifications which validators actually proposed the blocks that are being used in the justification, right?
Greg: Yes. You get this recursive structure.
Isaac: So this is protocol enforcement of a general rule of thumb: don’t believe somebody who is only using their own work to justify their future work. You want to look at someone who’s taken a little more holistic view of things. You want protocol enforcement of that and not just have a rule of thumb to follow.
Greg: That’s exactly right. If you think of the spanning tree of the justification dag as a term in some abstract syntax, then what we’re saying is that these liveness constraints correspond to certain kinds of context-free grammars that are weighted. It’s not just a plain grammar because you’re now taking into account the stake weights.
These context-free grammars can also be turned into the kind of Hamming distance-based balls that are discussed in the error-correcting code papers that I referred to a few months ago. What’s reachable by the grammar can be turned into what’s reachable by these Hamming distance gadgets. That’s the connection there.
Christian: Is each block a term in the grammar or is the whole dag a term in the grammar?
Greg: The grammar is an outside spec that is saying when you propose a block if it isn’t accepted by this grammar, it’s not valid. It’s a rule. It’s a constraint. One of the things that we discussed on Monday was what is the end protocol enforcement about this good shape? If you just have the validators implement this logic, but you don’t actually punish validators disobeying the logic, at least in some form, then they can still cause disequilibrium in the network.
Christian: Anytime you’re trying to impose some kind of conditions on a certain kind of dag in a chain, there’s a corresponding grammar, and there’s also going to be some corresponding contexts in the calculus?
Greg: That’s exactly right. Liveness conditions correspond to grammars; those grammars can be implemented by context. There’s a recursion property, which is that every justification dag is homomorphic to the total blockchain dag. They’re not the same. A justification dag is not going to be the same as the blockchain dag. At the end of the day, the block that you’ve got is not the entire dag. It’s a piece of the dag. That’s how all the pieces fit together.
Let me finish the point about punishing. The simplest form of punishment is social ostracism. When you send me a block that doesn’t match the grammar, I drop it on the floor. If your dog is misbehaving, you ignore it. Dogs typically don’t like to be ignored so they start behaving. It’s that kind of behavior modification.
You can go a little further because even then, it can still be a little bit annoying to have the dog keep jumping up on you. You’re ignoring the dog, it’s jumping, you’re ignoring the dog, it’s jumping, and then eventually the dog figures out that as long as it’s jumping up on you, you’re going to ignore it. So it modifies its behavior. During that time, you’ve used up some communication bandwidth.
You could consider additional behavior modification. In the case of our protocol that could correspond to slashing, if they’re sending an invalid block, you can slash. The trick is to disambiguate whether you want to do what we call slashing to the bone, where the entire stake is eradicated, or you do partial slashing. Death by a thousand cuts.
In this particular case, I have a hunch that we should use this kind of approach to deal with network partition and rejoining. You could modify the conditions so that if a validator could provably show—this part is a little mysterious and tricky for me—but if they could show, look, I’ve only been able to see messages from these validators in the last little while. I’m still respecting the rule, but my total stake is of a reduced pool of validators. Then you could do partial slashing in case they’re lying. When the network gets rejoined, they get the other messages and the partial slashing can be given back to them.
While the network is partitioned, they’re still making progress, they’re not knocked out of the game. If it turns out that everything was verifiably the case, the partial slashing is returned to them. This is a much more subtle protocol. There are lots of edge conditions that have to be checked. But the idea is to use it to deal with network partitioning and then rejoining and reconciliation.
Christian: Do we have a way to represent an agent’s history of observation?
Greg: It’s a little tricky because they can lie. There are some things that we would have to enforce. That’s why you want to do a partial slash. If they’re lying and someone can later prove that they’re lying, then the partial ash slash remains. Otherwise, if there’s no evidence of lying, then they could be made whole again.
Christian: There’s no way to make their past communications transparent to the network?
Greg: It’s not about their communications. It’s about what was sent to them. You can’t prove that they got it. Well, there might be ways to prove it, but that’s a trickier thing. One step at a time.
This is what makes me excited about RChain. These are lovely problems. They’re beautiful problems. They’re clearly related to not just computer communication but other kinds of agency. We have a really powerful theory that answers the questions and directly solves these network equilibria problems. They are common-sense arguments. It’s the kind of thing that you could walk into a high school and do some basic setup of the thing and expect bright high school kids to get it.