* [bitcoin-dev] A Better MMR Definition @ 2017-02-23 1:15 Peter Todd 2017-02-23 3:07 ` Bram Cohen 2017-02-23 17:53 ` Chris Priest 0 siblings, 2 replies; 29+ messages in thread From: Peter Todd @ 2017-02-23 1:15 UTC (permalink / raw) To: bitcoin-dev [-- Attachment #1: Type: text/plain, Size: 1966 bytes --] Reposting something that came up recently in a private discussion with some academics: Concretely, let's define a prunable MMR with the following grammar. This definition is an improvement on whats in the python-proofmarshal by committing to the number of items in the tree implicitly; an obvious max-log2(n)-sized proof-of-tree-size can be obtained by following the right-most nodes: Maybe(T) := UNPRUNED <T> | PRUNED <Commitment(T)> FullNode(0) := <Value> FullNode(n) := <Maybe(FullNode(n-1)> <Maybe(FullNode(n-1))> PartialNode(0) := SOME <FullNode(0)> | NONE PartialNode(n) := <Maybe(FullNode(n-1))> <Maybe(PartialNode(n-1))> MMR := FULL <N> <FullNode(n)> | PARTIAL <N> <PartialNode(n)> Basically we define it in four parts. First we define Maybe(T) to represent pruned and unpruned (hash only) data. Secondly we define full nodes within 2^n sized trees. Third we define partial nodes. And finally we define the MMR itself as being either a full or partial node. First of all, with pruning we can define a rule that if any operation (other than checking commitment hashes) attempts to access pruned data, it should immediately fail. In particular, no operation should be able to determine if data is or isn't pruned. Equally, note how an implementation can keep track of what data was accessed during any given operation, and prune the rest, which means a proof is just the parts of the data structure accessed during one or more operations. With that, notice how proving the soundness of the proofs becomes trivial: if validation is deterministic, it is obviously impossible to construct two different proofs that prove contradictory statements, because a proof is simply part of the data structure itself. Contradiction would imply that the two proofs are different, but that's easily rejected by simply checking the hash of the data. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-23 1:15 [bitcoin-dev] A Better MMR Definition Peter Todd @ 2017-02-23 3:07 ` Bram Cohen 2017-02-23 7:41 ` Peter Todd 2017-02-23 17:53 ` Chris Priest 1 sibling, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-02-23 3:07 UTC (permalink / raw) To: Peter Todd, Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 1862 bytes --] On Wed, Feb 22, 2017 at 5:15 PM, Peter Todd via bitcoin-dev < bitcoin-dev@lists.linuxfoundation.org> wrote: > > With that, notice how proving the soundness of the proofs becomes trivial: > if > validation is deterministic, it is obviously impossible to construct two > different proofs that prove contradictory statements, because a proof is > simply > part of the data structure itself. Contradiction would imply that the two > proofs are different, but that's easily rejected by simply checking the > hash of > the data. > My code works this way. Proofs are serialization of a subset of the tree, and to validate a proof you ask a single function whether a particular value is included in that tree subset, and it answers yes or no, so obviously it's impossible for a single value to both validate and not validate. The proof code was quite terrifying before I made this change (which I did on your suggestion), and it's much cleaner and simpler now. It also in principle supports compact proofs of multiple inclusions and exclusions in the same serialization of a subset of the tree because the upper branches won't have to be repeated. I haven't written code for generating those, but the validation code will happily accept them. I'm not sure what you mean by MMRs though. Are you talking about MMRs where each mountain is a set of diffs to the old things and are periodically consolidated? Or do later mountains refer to internals of earlier ones? Or do they have 'maybe' values which mean that the earlier mountain should be referred to? Are these patricia tries or something flatter and more fixed depth? My code doesn't keep track of tree size, by the way. It would be trivial to add that functionality to the library, and including it in the hashing creates complexity and doesn't seem to have any benefit over sending that data in a side channel. [-- Attachment #2: Type: text/html, Size: 2291 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-23 3:07 ` Bram Cohen @ 2017-02-23 7:41 ` Peter Todd 0 siblings, 0 replies; 29+ messages in thread From: Peter Todd @ 2017-02-23 7:41 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 3547 bytes --] On Wed, Feb 22, 2017 at 07:07:08PM -0800, Bram Cohen wrote: > On Wed, Feb 22, 2017 at 5:15 PM, Peter Todd via bitcoin-dev < > bitcoin-dev@lists.linuxfoundation.org> wrote: > > > > > With that, notice how proving the soundness of the proofs becomes trivial: > > if > > validation is deterministic, it is obviously impossible to construct two > > different proofs that prove contradictory statements, because a proof is > > simply > > part of the data structure itself. Contradiction would imply that the two > > proofs are different, but that's easily rejected by simply checking the > > hash of > > the data. > > > > My code works this way. Proofs are serialization of a subset of the tree, > and to validate a proof you ask a single function whether a particular > value is included in that tree subset, and it answers yes or no, so > obviously it's impossible for a single value to both validate and not > validate. The proof code was quite terrifying before I made this change > (which I did on your suggestion), and it's much cleaner and simpler now. It > also in principle supports compact proofs of multiple inclusions and > exclusions in the same serialization of a subset of the tree because the > upper branches won't have to be repeated. I haven't written code for > generating those, but the validation code will happily accept them. That's an improvement, but I think we can do even better if we think of missing pruned data as analogous to virtual memory: pruned data is the same as a page that has been swapped to disk, with the magical property that hashing allows us to swap it back in from an untrusted source. Thus a proof should actually be whatever data we expect our counterparty to have flushed, ranging from none at all, to 100% (modulo a root hash). An implementation should then do operations as normal, using parts of the proof on an as-needed basis where pruned data is encountered. Thus if you have a key-value map and do a get() operation, you'd expect the proof to *not* be what the get operates on, but rather be a *context* argument to the get() operation. The other way around is actually an example of doing computations on untrusted data, and bad API design! > I'm not sure what you mean by MMRs though. Are you talking about MMRs where > each mountain is a set of diffs to the old things and are periodically > consolidated? Or do later mountains refer to internals of earlier ones? Or > do they have 'maybe' values which mean that the earlier mountain should be > referred to? Are these patricia tries or something flatter and more fixed > depth? I'm talking about these MMR's: https://github.com/proofchains/python-proofmarshal/blob/master/proofmarshal/mmr.py Notably I'm talking about an insertion ordered list, indexed by position, that supports append and update operations, but *not* insertions; this is different than what you've recently published re: UTXO commitments. That's a full key-value map, something MMR's are delibrately are not doing. Draw out a MMR based on the formal definition you're replying too and you'll see the new structure. > My code doesn't keep track of tree size, by the way. It would be trivial to > add that functionality to the library, and including it in the hashing > creates complexity and doesn't seem to have any benefit over sending that > data in a side channel. Like I say above, you're solving a different problem than MMR's solve. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-23 1:15 [bitcoin-dev] A Better MMR Definition Peter Todd 2017-02-23 3:07 ` Bram Cohen @ 2017-02-23 17:53 ` Chris Priest 2017-02-23 18:19 ` Peter Todd 2017-02-23 23:13 ` Bram Cohen 1 sibling, 2 replies; 29+ messages in thread From: Chris Priest @ 2017-02-23 17:53 UTC (permalink / raw) To: Peter Todd, Bitcoin Protocol Discussion On 2/22/17, Peter Todd via bitcoin-dev <bitcoin-dev@lists.linuxfoundation.org> wrote: > Reposting something that came up recently in a private discussion with some > academics: > > Concretely, let's define a prunable MMR with the following grammar. This > definition is an improvement on whats in the python-proofmarshal by > committing > to the number of items in the tree implicitly; an obvious max-log2(n)-sized > proof-of-tree-size can be obtained by following the right-most nodes: > > Maybe(T) := UNPRUNED <T> | PRUNED <Commitment(T)> > > FullNode(0) := <Value> > FullNode(n) := <Maybe(FullNode(n-1)> <Maybe(FullNode(n-1))> > > PartialNode(0) := SOME <FullNode(0)> | NONE > PartialNode(n) := <Maybe(FullNode(n-1))> <Maybe(PartialNode(n-1))> > > MMR := FULL <N> <FullNode(n)> | PARTIAL <N> <PartialNode(n)> > > Basically we define it in four parts. First we define Maybe(T) to represent > pruned and unpruned (hash only) data. Secondly we define full nodes within > 2^n > sized trees. Third we define partial nodes. And finally we define the MMR > itself as being either a full or partial node. > > First of all, with pruning we can define a rule that if any operation > (other > than checking commitment hashes) attempts to access pruned data, it should > immediately fail. In particular, no operation should be able to determine > if > data is or isn't pruned. Equally, note how an implementation can keep track > of > what data was accessed during any given operation, and prune the rest, > which > means a proof is just the parts of the data structure accessed during one > or > more operations. > > With that, notice how proving the soundness of the proofs becomes trivial: > if > validation is deterministic, it is obviously impossible to construct two > different proofs that prove contradictory statements, because a proof is > simply > part of the data structure itself. Contradiction would imply that the two > proofs are different, but that's easily rejected by simply checking the hash > of > the data. > > -- > https://petertodd.org 'peter'[:-1]@petertodd.org > What problem does this try to solve, and what does it have to do with bitcoin? ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-23 17:53 ` Chris Priest @ 2017-02-23 18:19 ` Peter Todd 2017-02-23 18:28 ` G. Andrew Stone 2017-02-23 23:13 ` Bram Cohen 1 sibling, 1 reply; 29+ messages in thread From: Peter Todd @ 2017-02-23 18:19 UTC (permalink / raw) To: Chris Priest; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 842 bytes --] On Thu, Feb 23, 2017 at 09:53:58AM -0800, Chris Priest wrote: > On 2/22/17, Peter Todd via bitcoin-dev > <bitcoin-dev@lists.linuxfoundation.org> wrote: > > Reposting something that came up recently in a private discussion with some > > academics: > > > > Concretely, let's define a prunable MMR with the following grammar. This > > definition is an improvement on whats in the python-proofmarshal by > > committing > > to the number of items in the tree implicitly; an obvious max-log2(n)-sized > > proof-of-tree-size can be obtained by following the right-most nodes: > > What problem does this try to solve, and what does it have to do with bitcoin? See the discussion on TXO commitments for how MMR's could be used; a better MMR makes for a better TXO commitment. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-23 18:19 ` Peter Todd @ 2017-02-23 18:28 ` G. Andrew Stone 2017-02-23 18:31 ` Peter Todd 0 siblings, 1 reply; 29+ messages in thread From: G. Andrew Stone @ 2017-02-23 18:28 UTC (permalink / raw) To: Bitcoin Protocol Discussion, Peter Todd [-- Attachment #1: Type: text/plain, Size: 1226 bytes --] Can an insertion ordered MMR allow an efficient nonexistence proof? On Feb 23, 2017 1:20 PM, "Peter Todd via bitcoin-dev" < bitcoin-dev@lists.linuxfoundation.org> wrote: > On Thu, Feb 23, 2017 at 09:53:58AM -0800, Chris Priest wrote: > > On 2/22/17, Peter Todd via bitcoin-dev > > <bitcoin-dev@lists.linuxfoundation.org> wrote: > > > Reposting something that came up recently in a private discussion with > some > > > academics: > > > > > > Concretely, let's define a prunable MMR with the following grammar. > This > > > definition is an improvement on whats in the python-proofmarshal by > > > committing > > > to the number of items in the tree implicitly; an obvious > max-log2(n)-sized > > > proof-of-tree-size can be obtained by following the right-most nodes: > > > > What problem does this try to solve, and what does it have to do with > bitcoin? > > See the discussion on TXO commitments for how MMR's could be used; a > better MMR > makes for a better TXO commitment. > > -- > https://petertodd.org 'peter'[:-1]@petertodd.org > > _______________________________________________ > bitcoin-dev mailing list > bitcoin-dev@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev > > [-- Attachment #2: Type: text/html, Size: 2022 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-23 18:28 ` G. Andrew Stone @ 2017-02-23 18:31 ` Peter Todd 0 siblings, 0 replies; 29+ messages in thread From: Peter Todd @ 2017-02-23 18:31 UTC (permalink / raw) To: G. Andrew Stone; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 357 bytes --] On Thu, Feb 23, 2017 at 01:28:18PM -0500, G. Andrew Stone wrote: > Can an insertion ordered MMR allow an efficient nonexistence proof? Why do you want a non-existance proof? It supports an efficient *spentness* proof, which is sufficient for what we need in Bitcoin, and much more scalable. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-23 17:53 ` Chris Priest 2017-02-23 18:19 ` Peter Todd @ 2017-02-23 23:13 ` Bram Cohen 2017-02-23 23:51 ` Peter Todd 1 sibling, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-02-23 23:13 UTC (permalink / raw) To: Chris Priest, Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 1196 bytes --] On Thu, Feb 23, 2017 at 9:53 AM, Chris Priest via bitcoin-dev < bitcoin-dev@lists.linuxfoundation.org> wrote: > > What problem does this try to solve, and what does it have to do with > bitcoin? > I can't speak to MMRs (they look a bit redundant with the actual blockchain history to my eye) but circling back to utxo commitments, the benefits are that it enables actual proofs of non-fraud: You can prove the validity of a block based on just the previous block (and maybe some previous headers because of mining rewards) and can prove to a light node that a utxo hasn't been spent yet. A major factor in the way of getting utxo commitments in blocks is performance. The txo set is of course vastly larger and more unwieldy. If you make the utxo commitments trail by a small fixed number of blocks (between 2 and 5) their latency problems shouldn't be a big deal as long as the overall performance is good enough. My thesis is that with appropriate format and implementation tricks it's possible to get performance good enough to no longer be a gating factor to deployment. Disappointingly there hasn't been any feedback about my implementation, just discussion about merkle sets generally. [-- Attachment #2: Type: text/html, Size: 1633 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-23 23:13 ` Bram Cohen @ 2017-02-23 23:51 ` Peter Todd 2017-02-24 0:49 ` Bram Cohen 0 siblings, 1 reply; 29+ messages in thread From: Peter Todd @ 2017-02-23 23:51 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 2947 bytes --] On Thu, Feb 23, 2017 at 03:13:43PM -0800, Bram Cohen wrote: > On Thu, Feb 23, 2017 at 9:53 AM, Chris Priest via bitcoin-dev < > bitcoin-dev@lists.linuxfoundation.org> wrote: > > > > > What problem does this try to solve, and what does it have to do with > > bitcoin? > > > > I can't speak to MMRs (they look a bit redundant with the actual blockchain > history to my eye) but circling back to utxo commitments, the benefits are In what way do you see MMRs as redundant? Remember that with UTXO commitments because access patterns are uniform, you'll over time have a lot more "redundancy" in the form of lost-coins evenly spread out across the whole keyspace. > that it enables actual proofs of non-fraud: You can prove the validity of a > block based on just the previous block (and maybe some previous headers > because of mining rewards) and can prove to a light node that a utxo hasn't > been spent yet. > > A major factor in the way of getting utxo commitments in blocks is > performance. The txo set is of course vastly larger and more unwieldy. If That statement is incorrect with pruning: you can maintain a commitment to the TXO set, without actually storing the entire TXO set, because you don't need to store anything for nodes that have already been spent. Concretely, this can be done with nothing more than adding a FullySpent node type to the MMR definition I published earlier, with the rule being that only a left or right child of an inner node be a FullySpent node, not both; if both sides are spent, the inner node itself becomes FullySpent. Equally, I think you can re-use the Empty node for this, but I need to think a little about the implications re: partial inner nodes. Regardless, with a generalized commitment scheme, the serialization/commitment to an Empty node is simply '0', the encoding of an unspent txout surrounded by spent txouts will be similar in size to a position integer followed by the txout... A subtlety of this construction is that you can only prove that a specific txout # is unspent, but that's actually sufficient, as you can also prove what # a txout txid corresponds too with a previous version of the MMR. > you make the utxo commitments trail by a small fixed number of blocks > (between 2 and 5) their latency problems shouldn't be a big deal as long as > the overall performance is good enough. My thesis is that with appropriate > format and implementation tricks it's possible to get performance good > enough to no longer be a gating factor to deployment. > > Disappointingly there hasn't been any feedback about my implementation, > just discussion about merkle sets generally. Well, I think at this point there's still discussion over whether or not a UTXO set commitment is the right approach to begin with; if it's not your implementation isn't relevant. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-23 23:51 ` Peter Todd @ 2017-02-24 0:49 ` Bram Cohen 2017-02-24 1:09 ` Peter Todd 0 siblings, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-02-24 0:49 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 1386 bytes --] On Thu, Feb 23, 2017 at 3:51 PM, Peter Todd <pete@petertodd.org> wrote: > On Thu, Feb 23, 2017 at 03:13:43PM -0800, Bram Cohen wrote: > > > > I can't speak to MMRs (they look a bit redundant with the actual > blockchain > > history to my eye) but circling back to utxo commitments, the benefits > are > > In what way do you see MMRs as redundant? > You can readily prove something is in the TXO or STXO set using the actual blockchain, and the proofs will be nice and compact because even light nodes are expected to already have all the historical headers. What you can't do with MMRs or the blockchain is make a compact proof that something is still in the utxo set, which is the whole point of utxo commitments. It's totally reasonable for full nodes to independently update and recalculate the utxo set as part of their validation process. The same can't be done for a balanced version of the txo set because it's too big. Relying on proofs as a crutch for using the full txo set would badly exacerbate the already extant problem of miners doing spv mining, and increase the bandwidth a full validating node had to use by a multiple. This whole conversation is badly sidetracked. If people have comments on my merkle set I'd like to engage further with them, but mmrs need to be argued independently on their own merits before being used as a counterpoint to utxo commitments. [-- Attachment #2: Type: text/html, Size: 1848 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-24 0:49 ` Bram Cohen @ 2017-02-24 1:09 ` Peter Todd 2017-02-24 2:50 ` Bram Cohen 0 siblings, 1 reply; 29+ messages in thread From: Peter Todd @ 2017-02-24 1:09 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 2737 bytes --] On Thu, Feb 23, 2017 at 04:49:01PM -0800, Bram Cohen wrote: > On Thu, Feb 23, 2017 at 3:51 PM, Peter Todd <pete@petertodd.org> wrote: > > > On Thu, Feb 23, 2017 at 03:13:43PM -0800, Bram Cohen wrote: > > > > > > I can't speak to MMRs (they look a bit redundant with the actual > > blockchain > > > history to my eye) but circling back to utxo commitments, the benefits > > are > > > > In what way do you see MMRs as redundant? > > > > You can readily prove something is in the TXO or STXO set using the actual > blockchain, and the proofs will be nice and compact because even light > nodes are expected to already have all the historical headers. > > What you can't do with MMRs or the blockchain is make a compact proof that > something is still in the utxo set, which is the whole point of utxo > commitments. I think you've misunderstood what TXO commitments are. From my article: "A merkle tree committing to the state of all transaction outputs, both spent and unspent, can provide a method of compactly proving the current state of an output." -https://petertodd.org/2016/delayed-txo-commitments#txo-commitments: I'm proposing that we commit to not just the set of transaction outputs, but also the current *state* of those outputs, with the same commitment structure. Concretely, each leaf node in the TXO commitment tree needs to commit to - at minimum - the outpoint (txid:n) and spent/unspent status (possibly structurally as mentioned elsewhere in this thread). It's probably also valuable to commit to the scriptPubKey, nValue, as well, though technically that's redundant as the txid already commits to that (there's some implementation options here). > It's totally reasonable for full nodes to independently update and > recalculate the utxo set as part of their validation process. The same > can't be done for a balanced version of the txo set because it's too big. Why would you commit to a balanced version of the TXO set? I'm proposing committing to an insertion-ordered list, indexed by txout #. > Relying on proofs as a crutch for using the full txo set would badly > exacerbate the already extant problem of miners doing spv mining, and > increase the bandwidth a full validating node had to use by a multiple. > > This whole conversation is badly sidetracked. If people have comments on my > merkle set I'd like to engage further with them, but mmrs need to be argued > independently on their own merits before being used as a counterpoint to > utxo commitments. Hmm? That's exactly what I'm doing. Also, as per the above, I think you've misunderstood what my TXO commitment proposal is. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-24 1:09 ` Peter Todd @ 2017-02-24 2:50 ` Bram Cohen 2017-02-24 2:58 ` Peter Todd 0 siblings, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-02-24 2:50 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 810 bytes --] On Thu, Feb 23, 2017 at 5:09 PM, Peter Todd <pete@petertodd.org> wrote: > I think you've misunderstood what TXO commitments are. From my article: > > "A merkle tree committing to the state of all transaction outputs, both > spent > and unspent, can provide a method of compactly proving the current state > of an > output." > -https://petertodd.org/2016/delayed-txo-commitments#txo-commitments: > The proposal on that page is of a tree which does require random access updates, it just positions entries in the order they happened to be added instead of sorting by their hash. Once you start updating it to indicate spent status all the exact same issues of TXO size and cache coherence on updates show up again, but now you're using a more complex bespoke data structure instead of a basic fundamental one. [-- Attachment #2: Type: text/html, Size: 1276 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-24 2:50 ` Bram Cohen @ 2017-02-24 2:58 ` Peter Todd 2017-02-24 3:02 ` Bram Cohen 0 siblings, 1 reply; 29+ messages in thread From: Peter Todd @ 2017-02-24 2:58 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 1437 bytes --] On Thu, Feb 23, 2017 at 06:50:10PM -0800, Bram Cohen wrote: > On Thu, Feb 23, 2017 at 5:09 PM, Peter Todd <pete@petertodd.org> wrote: > > > I think you've misunderstood what TXO commitments are. From my article: > > > > "A merkle tree committing to the state of all transaction outputs, both > > spent > > and unspent, can provide a method of compactly proving the current state > > of an > > output." > > -https://petertodd.org/2016/delayed-txo-commitments#txo-commitments: > > > > The proposal on that page is of a tree which does require random access > updates, it just positions entries in the order they happened to be added > instead of sorting by their hash. Once you start updating it to indicate > spent status all the exact same issues of TXO size and cache coherence on > updates show up again, but now you're using a more complex bespoke data > structure instead of a basic fundamental one. Sorry, but I was replying to your statement: > What you can't do with MMRs or the blockchain is make a compact proof that > something is still in the utxo set, which is the whole point of utxo > commitments. So to be clear, do you agree or disagree with me that you *can* extract a compact proof from a MMR that a given output is unspent? I just want to make sure we're on the same page here before we discuss performance characteristics. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-24 2:58 ` Peter Todd @ 2017-02-24 3:02 ` Bram Cohen 2017-02-24 3:15 ` Peter Todd 0 siblings, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-02-24 3:02 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 519 bytes --] On Thu, Feb 23, 2017 at 6:58 PM, Peter Todd <pete@petertodd.org> wrote: > > So to be clear, do you agree or disagree with me that you *can* extract a > compact proof from a MMR that a given output is unspent? > After wading through your logic on how updates are done, I agree that that can be done, but apples to apples compact proofs can also be done in a utxo commitment, and proofs of the validity of updates can be done in a utxo commitment, so there isn't any performance advantage to all that extra complexity. [-- Attachment #2: Type: text/html, Size: 886 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-24 3:02 ` Bram Cohen @ 2017-02-24 3:15 ` Peter Todd 2017-02-24 3:32 ` Bram Cohen 0 siblings, 1 reply; 29+ messages in thread From: Peter Todd @ 2017-02-24 3:15 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 1080 bytes --] On Thu, Feb 23, 2017 at 07:02:36PM -0800, Bram Cohen wrote: > On Thu, Feb 23, 2017 at 6:58 PM, Peter Todd <pete@petertodd.org> wrote: > > > > > So to be clear, do you agree or disagree with me that you *can* extract a > > compact proof from a MMR that a given output is unspent? > > > > After wading through your logic on how updates are done, I agree that that > can be done, but apples to apples compact proofs can also be done in a utxo > commitment, and proofs of the validity of updates can be done in a utxo > commitment, so there isn't any performance advantage to all that extra > complexity. Glad we're on the same page with regard to what's possible in TXO commitments. Secondly, am I correct in saying your UTXO commitments scheme requires random access? While you describe it as a "merkle set", obviously to be merkelized it'll have to have an ordering of some kind. What do you propose that ordering to be? Maybe more specifically, what exact values do you propose to be in the set? -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-24 3:15 ` Peter Todd @ 2017-02-24 3:32 ` Bram Cohen 2017-02-24 4:36 ` Peter Todd 0 siblings, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-02-24 3:32 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 958 bytes --] On Thu, Feb 23, 2017 at 7:15 PM, Peter Todd <pete@petertodd.org> wrote: > > Glad we're on the same page with regard to what's possible in TXO > commitments. > > Secondly, am I correct in saying your UTXO commitments scheme requires > random > access? While you describe it as a "merkle set", obviously to be merkelized > it'll have to have an ordering of some kind. What do you propose that > ordering > to be? > The ordering is by the bits in the hash. Technically it's a Patricia Trie. I'm using 'merkle tree' to refer to basically anything with a hash root. > Maybe more specifically, what exact values do you propose to be in the set? > > That is unspecified in the implementation, it just takes a 256 bit value which is presumably a hash of something. The intention is to nail down a simple format and demonstrate good performance and leave those semantics to a higher layer. The simplest thing would be to hash together the txid and output number. [-- Attachment #2: Type: text/html, Size: 1570 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-24 3:32 ` Bram Cohen @ 2017-02-24 4:36 ` Peter Todd 2017-02-24 22:20 ` Bram Cohen 0 siblings, 1 reply; 29+ messages in thread From: Peter Todd @ 2017-02-24 4:36 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 5151 bytes --] On Thu, Feb 23, 2017 at 07:32:43PM -0800, Bram Cohen wrote: > On Thu, Feb 23, 2017 at 7:15 PM, Peter Todd <pete@petertodd.org> wrote: > > > > > Glad we're on the same page with regard to what's possible in TXO > > commitments. > > > > Secondly, am I correct in saying your UTXO commitments scheme requires > > random > > access? While you describe it as a "merkle set", obviously to be merkelized > > it'll have to have an ordering of some kind. What do you propose that > > ordering > > to be? > > > > The ordering is by the bits in the hash. Technically it's a Patricia Trie. > I'm using 'merkle tree' to refer to basically anything with a hash root. The hash of what? The values in the set? > > Maybe more specifically, what exact values do you propose to be in the set? > > > > > That is unspecified in the implementation, it just takes a 256 bit value > which is presumably a hash of something. The intention is to nail down a > simple format and demonstrate good performance and leave those semantics to > a higher layer. The simplest thing would be to hash together the txid and > output number. Ok, so let's assume the values in the set are the unspent outpoints. Since we're ordering by the hash of the values in the set, outpoints will be distributed uniformly in the set, and thus the access pattern of data in the set is uniform. Now let's fast-forward 10 years. For the sake of argument, assume that for every 1 UTXO in the set that corresponds to funds in someone's wallet that are likely to be spent, there are 2^12 = 4096 UTXO's that have been permanently lost (and/or created in spam attacks) and thus will never be spent. Since lost UTXO's are *also* uniformly distributed, if I'm processing a new block that spends 2^12 = 4096 UTXO's, on average for each UTXO spent, I'll have to update log2(4096) = 12 more digests than I would have had those "dead" UTXO's not existed. Concretely, imagine our UTXO set had just 8 values in it, and we were updating two of them: # / \ / \ / \ / \ / \ # # / \ / \ / \ / \ # . . # / \ / \ / \ / \ . X . . . . X . To mark two coins as spent, we've had to update 5 inner nodes. Now let's look at what happens in an insertion-ordered TXO commitment scheme. For sake of argument, let's assume the best possible case, where every UTXO spent in that same block was recently created. Since the UTXO's are recently created, chances are almost every single one of those "dead" UTXO's will have been created in the past. Thus, since this is an insertion-ordered data structure, those UTXO's exist in an older part of the data structure that our new block doesn't need to modify at all. Concretely, again let's imagine a TXO commitment with 8 values in it, and two of them being spent: # / \ / \ / \ / \ / \ . # / \ / \ / \ / \ . . . # / \ / \ / \ / \ . . . . . . X X To mark two coins as spent, we've only had to update 3 inner nodes; while our tree is higher with those lost coins, those extra inner nodes are amortised across all the coins we have to update. The situation gets even better when we look at the *new* UTXO's that our block creates. Suppose our UTXO set has size n. To mark a single coin as spent, we have to update log2(n) inner nodes. We do get to amortise this a bit at the top levels in the tree, but even if we assume the amortisation is totally free, we're updating at least log2(n) - log2(m) inner nodes "under" the amortised nodes at the top of the tree for *each* new node. Meanwhile with an insertion-ordered TXO commitment, each new UTXO added to the data set goes in the same place - the end. So almost none of the existing data needs to be touched to add the new UTXOs. Equally, the hashing required for the new UTXO's can be done in an incremental fashion that's very L1/L2 cache friendly. tl;dr: Precisely because access patterns in TXO commitments are *not* uniform, I think we'll find that from a L1/L2/etc cache perspective alone, TXO commitments will result in better performance than UTXO commitments. Now it is true that Bitcoin's current design means we'll need a map of confirmed outpoints to TXO insertion order indexes. But it's not particularly hard to add that "metadata" to transactions on the P2P layer in the same way that segwit added witnesses to transactions without modifying how txids were calculated; if you only connect to peers who provide you with TXO index information in blocks and transactions, you don't need to keep that map yourself. Finally, note how this makes transactions *smaller* in many circumstances: it's just a 8-byte max index rather than a 40 byte outpoint. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-24 4:36 ` Peter Todd @ 2017-02-24 22:20 ` Bram Cohen 2017-02-25 4:12 ` Peter Todd 0 siblings, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-02-24 22:20 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 7156 bytes --] So your idea is to cluster entries by entry time because newer things are more likely to leave and updating multiple things near each other is cheaper? That can be done with my tool. Instead of using hashes for the values being stored, you use position entries. The first entry gets a value of all zeros, the next one a one followed by all zeros, then the next two correspond to the first two with the second bit flipped to one, then the next four the first four with the third bit flipped to one, etc. It probably performs a little bit better to do it two bits at a time instead of one so that the entries are 00, 01, 10, 11, 0001, 0010, 0011, 0101, 0110, 0111, 1001, etc. If you were to really use this you'd probably want to to add some optimizations to use the fact that the terminals fit in 64 bits instead of 256, but it mostly works unchanged, and gets whatever benefits there are to this clustering plus the high performance implementation tricks I've built which I keep complaining that nobody's giving feedback on. I'm not sold on this being a win: The empirical access patterns are unknown, it requires an extra cache miss per lookup to find the entry number, it may be that everything is optimized well enough without it for there to be no meaningful gains, and it's a bunch of extra complexity. What should be done is that a plain vanilla UTXO set solution is optimized as well as it can be first, and then the insertion ordering trick is tried as an optimization to see if it's an improvement. Without that baseline there's no meaningful basis for comparison, and I'm quite confident that a naive implementation which just allocates individual nodes will underperform the thing I've come up with, even without adding optimizations related to fitting in 64 bits. On Thu, Feb 23, 2017 at 8:36 PM, Peter Todd <pete@petertodd.org> wrote: > On Thu, Feb 23, 2017 at 07:32:43PM -0800, Bram Cohen wrote: > > On Thu, Feb 23, 2017 at 7:15 PM, Peter Todd <pete@petertodd.org> wrote: > > > > > > > > Glad we're on the same page with regard to what's possible in TXO > > > commitments. > > > > > > Secondly, am I correct in saying your UTXO commitments scheme requires > > > random > > > access? While you describe it as a "merkle set", obviously to be > merkelized > > > it'll have to have an ordering of some kind. What do you propose that > > > ordering > > > to be? > > > > > > > The ordering is by the bits in the hash. Technically it's a Patricia > Trie. > > I'm using 'merkle tree' to refer to basically anything with a hash root. > > The hash of what? The values in the set? > > > > Maybe more specifically, what exact values do you propose to be in the > set? > > > > > > > > That is unspecified in the implementation, it just takes a 256 bit value > > which is presumably a hash of something. The intention is to nail down a > > simple format and demonstrate good performance and leave those semantics > to > > a higher layer. The simplest thing would be to hash together the txid and > > output number. > > Ok, so let's assume the values in the set are the unspent outpoints. > > Since we're ordering by the hash of the values in the set, outpoints will > be > distributed uniformly in the set, and thus the access pattern of data in > the > set is uniform. > > Now let's fast-forward 10 years. For the sake of argument, assume that for > every 1 UTXO in the set that corresponds to funds in someone's wallet that > are > likely to be spent, there are 2^12 = 4096 UTXO's that have been permanently > lost (and/or created in spam attacks) and thus will never be spent. > > Since lost UTXO's are *also* uniformly distributed, if I'm processing a new > block that spends 2^12 = 4096 UTXO's, on average for each UTXO spent, I'll > have to update log2(4096) = 12 more digests than I would have had those > "dead" > UTXO's not existed. > > Concretely, imagine our UTXO set had just 8 values in it, and we were > updating > two of them: > > # > / \ > / \ > / \ > / \ > / \ > # # > / \ / \ > / \ / \ > # . . # > / \ / \ / \ / \ > . X . . . . X . > > To mark two coins as spent, we've had to update 5 inner nodes. > > > Now let's look at what happens in an insertion-ordered TXO commitment > scheme. > For sake of argument, let's assume the best possible case, where every UTXO > spent in that same block was recently created. Since the UTXO's are > recently > created, chances are almost every single one of those "dead" UTXO's will > have > been created in the past. Thus, since this is an insertion-ordered data > structure, those UTXO's exist in an older part of the data structure that > our > new block doesn't need to modify at all. > > Concretely, again let's imagine a TXO commitment with 8 values in it, and > two > of them being spent: > > # > / \ > / \ > / \ > / \ > / \ > . # > / \ / \ > / \ / \ > . . . # > / \ / \ / \ / \ > . . . . . . X X > > To mark two coins as spent, we've only had to update 3 inner nodes; while > our > tree is higher with those lost coins, those extra inner nodes are amortised > across all the coins we have to update. > > > The situation gets even better when we look at the *new* UTXO's that our > block > creates. Suppose our UTXO set has size n. To mark a single coin as spent, > we > have to update log2(n) inner nodes. We do get to amortise this a bit at > the top > levels in the tree, but even if we assume the amortisation is totally free, > we're updating at least log2(n) - log2(m) inner nodes "under" the amortised > nodes at the top of the tree for *each* new node. > > Meanwhile with an insertion-ordered TXO commitment, each new UTXO added to > the > data set goes in the same place - the end. So almost none of the existing > data > needs to be touched to add the new UTXOs. Equally, the hashing required > for the > new UTXO's can be done in an incremental fashion that's very L1/L2 cache > friendly. > > > tl;dr: Precisely because access patterns in TXO commitments are *not* > uniform, > I think we'll find that from a L1/L2/etc cache perspective alone, TXO > commitments will result in better performance than UTXO commitments. > > > Now it is true that Bitcoin's current design means we'll need a map of > confirmed outpoints to TXO insertion order indexes. But it's not > particularly > hard to add that "metadata" to transactions on the P2P layer in the same > way > that segwit added witnesses to transactions without modifying how txids > were > calculated; if you only connect to peers who provide you with TXO index > information in blocks and transactions, you don't need to keep that map > yourself. > > Finally, note how this makes transactions *smaller* in many circumstances: > it's > just a 8-byte max index rather than a 40 byte outpoint. > > -- > https://petertodd.org 'peter'[:-1]@petertodd.org > [-- Attachment #2: Type: text/html, Size: 8700 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-24 22:20 ` Bram Cohen @ 2017-02-25 4:12 ` Peter Todd 2017-02-25 6:23 ` Bram Cohen 0 siblings, 1 reply; 29+ messages in thread From: Peter Todd @ 2017-02-25 4:12 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 5105 bytes --] On Fri, Feb 24, 2017 at 02:20:19PM -0800, Bram Cohen wrote: > So your idea is to cluster entries by entry time because newer things are > more likely to leave and updating multiple things near each other is > cheaper? Yes, exactly. > That can be done with my tool. Instead of using hashes for the values being > stored, you use position entries. The first entry gets a value of all > zeros, the next one a one followed by all zeros, then the next two > correspond to the first two with the second bit flipped to one, then the > next four the first four with the third bit flipped to one, etc. It > probably performs a little bit better to do it two bits at a time instead > of one so that the entries are 00, 01, 10, 11, 0001, 0010, 0011, 0101, > 0110, 0111, 1001, etc. If you were to really use this you'd probably want > to to add some optimizations to use the fact that the terminals fit in 64 > bits instead of 256, but it mostly works unchanged, and gets whatever So to be clear, what you're proposing there is to use the insertion order as the index - once you go that far you've almost entirely re-invented my proposal! In fact, when I was working my proofchains/proofmarshal libraries I put some thought into whether or not I could leave out the MMR merkelized list implementation and use only the key-value map I also wrote. I decided to include both as they aren't quite the same datastructure - using a list for a list has advantages. > benefits there are to this clustering plus the high performance > implementation tricks I've built which I keep complaining that nobody's > giving feedback on. Your merkle-set implementation is 1500 lines of densely written Python with almost no comments, and less than a 100 lines of (also uncommented) tests. By comparison, my Python MMR implementation is 300 lines of very readable Python with lots of comments, a 200 line explanation at the top, and 200 lines of (commented) tests. Yet no-one is taking the (still considerable) effort to understand and comment on my implementation. :) Fact is, what you've written is really daunting to review, and given it's not in the final language anyway, it's unclear what basis to review it on anyway. I suspect you'd get more feedback if the codebase was better commented, in a production language, and you have actual real-world benchmarks and performance figures. In particular, while at the top of merkle_set.py you have a list of advantages, and a bunch of TODO's, you don't explain *why* the code has any of these advantages. To figure that out, I'd have to read and understand 1500 lines of densely written Python. Without a human-readable pitch, not many people are going to do that, myself included. > I'm not sold on this being a win: The empirical access patterns are > unknown, Lost coins alone guarantees that access patterns will be biased towards new coins being more likely to be spent. That basis alone is sufficient to justify an insertion-ordered data structure. Additionally, people have done graphs of the average age of UTXO's when spent, and that data clearly shows that newer coins are more likely to be spent than older coins. > unknown, it requires an extra cache miss per lookup to find the entry > number, Like I mentioned in the email you're replying to, that extra lookup can be easily avoided with a change to how transactions/blocks are serialized; if all your peers support TXO commitments you can even discard the lookup database entirely, as it's only a backwards compatibility measure. > it may be that everything is optimized well enough without it for > there to be no meaningful gains, and it's a bunch of extra complexity. What Optimization is itself extra complexity. If you're data structure has worse inherent performance, and you have to make up the different with a highly optimized implementation, that's likely to lead to more overall complexity than using a data structure with better inherent performance. Your current merkle-set implementation definitely _is_ very complex. An apples-to-apples comparison is with my merkelized key:value tree(1), also a patricia tree, which like the MMR is only about 300 lines of well-commented and straight-forward code. 1) https://github.com/proofchains/python-proofmarshal/blob/master/proofmarshal/merbinnertree.py > should be done is that a plain vanilla UTXO set solution is optimized as > well as it can be first, and then the insertion ordering trick is tried as > an optimization to see if it's an improvement. Without that baseline > there's no meaningful basis for comparison, and I'm quite confident that a > naive implementation which just allocates individual nodes will > underperform the thing I've come up with, even without adding optimizations > related to fitting in 64 bits. To be clear, "insertion ordering" isn't a simple trick, it's a fundamental change to what the data structure is. Once you do that, you're talking about my proposal. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-25 4:12 ` Peter Todd @ 2017-02-25 6:23 ` Bram Cohen 2017-02-28 16:43 ` G. Andrew Stone 2017-03-01 22:31 ` Peter Todd 0 siblings, 2 replies; 29+ messages in thread From: Bram Cohen @ 2017-02-25 6:23 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 3731 bytes --] On Fri, Feb 24, 2017 at 8:12 PM, Peter Todd <pete@petertodd.org> wrote: > > So to be clear, what you're proposing there is to use the insertion order > as > the index - once you go that far you've almost entirely re-invented my > proposal! > I'm not 'proposing' this, I'm saying it could be done simply but I'm skeptical of the utility. Probably the most compelling argument for it is that the insertion indexed values are much smaller so they can be compacted down a lot resulting in using less memory and more locality and fewer hashes, but your implementation doesn't take advantage of that. > Your merkle-set implementation is 1500 lines of densely written Python The reference implementation which is included in those 1500 lines is less than 300 lines and fairly straightforward. The non-reference implementation always behaves semantically identically to the reference implementation, it just does so faster and using less memory. > with > almost no comments, The comments at the top explain both the proof format and the in-memory data structures very precisely. The whole codebase was reviewed by a coworker of mine and comments were added explaining the subtleties which tripped him up. > and less than a 100 lines of (also uncommented) tests. Those tests get 98% code coverage and extensively hit not only the lines of code but the semantic edge cases as well. The lines which aren't hit are convenience functions and error conditions of the parsing code for when it's passed bad data. > By > comparison, my Python MMR implementation is 300 lines of very readable > Python > with lots of comments, a 200 line explanation at the top, and 200 lines of > (commented) tests. Yet no-one is taking the (still considerable) effort to > understand and comment on my implementation. :) > Given that maaku's Merkle prefix trees were shelved because of performance problems despite being written in C and operating in basically the same way as your code and my reference code, it's clear that non-optimized Python won't be touching the bitcoin codebase any time soon. > > Fact is, what you've written is really daunting to review, and given it's > not > in the final language anyway, it's unclear what basis to review it on > anyway. It should reviewed based on semantic correctness and performance. Performance can only be accurately and convincingly determined by porting to C and optimizing it, which mostly involves experimenting with different values for the two passed in magic numbers. > I > suspect you'd get more feedback if the codebase was better commented, in a > production language, and you have actual real-world benchmarks and > performance > figures. > Porting to C should be straightforward. Several people have already expressed interest in doing so, and it's written in intentionally C-ish Python, resulting in some rather odd idioms which is a bit part of why you think it looks 'dense'. A lot of that weird offset math should be much more readable in C because it's all structs and x.y notation can be used instead of adding offsets. > In particular, while at the top of merkle_set.py you have a list of > advantages, > and a bunch of TODO's, you don't explain *why* the code has any of these > advantages. To figure that out, I'd have to read and understand 1500 lines > of > densely written Python. Without a human-readable pitch, not many people are > going to do that, myself included. > It's all about cache coherence. When doing operations it pulls in a bunch of things which are near each other in memory instead of jumping all over the place. The improvements it gets should be much greater than the ones gained from insertion ordering, although the two could be accretive. [-- Attachment #2: Type: text/html, Size: 5256 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-25 6:23 ` Bram Cohen @ 2017-02-28 16:43 ` G. Andrew Stone 2017-02-28 23:10 ` Bram Cohen 2017-03-01 22:31 ` Peter Todd 1 sibling, 1 reply; 29+ messages in thread From: G. Andrew Stone @ 2017-02-28 16:43 UTC (permalink / raw) To: Bram Cohen, Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 5391 bytes --] I can understand how Bram's transaction double sha256 hashed UTXO set patricia trie allows a client to quickly validate inputs because the inputs of a transaction are specified in the same manner. So to verify that an input is unspent the client simply traverses the patricia trie. It also makes sense that if transaction inputs were specified by a [block height, tx index, output index] triple we'd have a much more size-efficient transaction format. This format would make look up pretty simple in Peter's pruned time-ordered TXO merkle mountain range, although you'd have translate the triple to an index, which means we'd have to at a minimum keep track of the number of TXOs in each block, and then probably do a linear search starting from the location where the block's TXOs begin in the MMR. (The ultimate option I guess is to specify transaction inputs by a single number which is essentially the index of the TXO in a (never actually created) insertion-ordered TXO array...) But since transactions' prevouts are not specified by [block height, tx index, output index] or by TXO index, I don't understand how an insertion ordered TXO tree can result in efficient lookups. Can you help me understand this? On Sat, Feb 25, 2017 at 1:23 AM, Bram Cohen via bitcoin-dev < bitcoin-dev@lists.linuxfoundation.org> wrote: > On Fri, Feb 24, 2017 at 8:12 PM, Peter Todd <pete@petertodd.org> wrote: > >> >> So to be clear, what you're proposing there is to use the insertion order >> as >> the index - once you go that far you've almost entirely re-invented my >> proposal! >> > > I'm not 'proposing' this, I'm saying it could be done simply but I'm > skeptical of the utility. Probably the most compelling argument for it is > that the insertion indexed values are much smaller so they can be compacted > down a lot resulting in using less memory and more locality and fewer > hashes, but your implementation doesn't take advantage of that. > > >> Your merkle-set implementation is 1500 lines of densely written Python > > > The reference implementation which is included in those 1500 lines is less > than 300 lines and fairly straightforward. The non-reference implementation > always behaves semantically identically to the reference implementation, it > just does so faster and using less memory. > > >> with >> almost no comments, > > > The comments at the top explain both the proof format and the in-memory > data structures very precisely. The whole codebase was reviewed by a > coworker of mine and comments were added explaining the subtleties which > tripped him up. > > >> and less than a 100 lines of (also uncommented) tests. > > > Those tests get 98% code coverage and extensively hit not only the lines > of code but the semantic edge cases as well. The lines which aren't hit are > convenience functions and error conditions of the parsing code for when > it's passed bad data. > > >> By >> comparison, my Python MMR implementation is 300 lines of very readable >> Python >> with lots of comments, a 200 line explanation at the top, and 200 lines of >> (commented) tests. Yet no-one is taking the (still considerable) effort to >> understand and comment on my implementation. :) >> > > Given that maaku's Merkle prefix trees were shelved because of performance > problems despite being written in C and operating in basically the same way > as your code and my reference code, it's clear that non-optimized Python > won't be touching the bitcoin codebase any time soon. > > >> >> Fact is, what you've written is really daunting to review, and given it's >> not >> in the final language anyway, it's unclear what basis to review it on >> anyway. > > > It should reviewed based on semantic correctness and performance. > Performance can only be accurately and convincingly determined by porting > to C and optimizing it, which mostly involves experimenting with different > values for the two passed in magic numbers. > > >> I >> suspect you'd get more feedback if the codebase was better commented, in a >> production language, and you have actual real-world benchmarks and >> performance >> figures. >> > > Porting to C should be straightforward. Several people have already > expressed interest in doing so, and it's written in intentionally C-ish > Python, resulting in some rather odd idioms which is a bit part of why you > think it looks 'dense'. A lot of that weird offset math should be much more > readable in C because it's all structs and x.y notation can be used instead > of adding offsets. > > >> In particular, while at the top of merkle_set.py you have a list of >> advantages, >> and a bunch of TODO's, you don't explain *why* the code has any of these >> advantages. To figure that out, I'd have to read and understand 1500 >> lines of >> densely written Python. Without a human-readable pitch, not many people >> are >> going to do that, myself included. >> > > It's all about cache coherence. When doing operations it pulls in a bunch > of things which are near each other in memory instead of jumping all over > the place. The improvements it gets should be much greater than the ones > gained from insertion ordering, although the two could be accretive. > > > _______________________________________________ > bitcoin-dev mailing list > bitcoin-dev@lists.linuxfoundation.org > https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev > > [-- Attachment #2: Type: text/html, Size: 7570 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-28 16:43 ` G. Andrew Stone @ 2017-02-28 23:10 ` Bram Cohen 2017-02-28 23:24 ` Pieter Wuille 0 siblings, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-02-28 23:10 UTC (permalink / raw) To: G. Andrew Stone; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 1310 bytes --] On Tue, Feb 28, 2017 at 8:43 AM, G. Andrew Stone <g.andrew.stone@gmail.com> wrote: > > But since transactions' prevouts are not specified by [block height, tx > index, output index] or by TXO index, I don't understand how an insertion > ordered TXO tree can result in efficient lookups. Can you help me > understand this? > You have to have a lookup table going from prevouts to txo index. Lookups on that are relatively fast because looking up things in a hashtable is a single cache miss, while looking up things in a tree is logarithmic cache misses. The purported benefit of using txout is that because recent things are spent much more than old things, there's a lot of clustering of updates. If you update two things near each other they share the top branches of updates in the tree, resulting in less hashing and cache misses. But since everything is log scale I suspect such benefits are small. My guess is transaction ordering has much larger potential from compression because you cram information about lots of things into a single leaf node because they have very small diffs from each other. That said, those benefits are also smaller than and accretive to the simple implementation tricks I already implemented which cause things near each other in the tree to be near each other in memory. [-- Attachment #2: Type: text/html, Size: 1730 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-28 23:10 ` Bram Cohen @ 2017-02-28 23:24 ` Pieter Wuille 2017-03-01 1:47 ` Bram Cohen 0 siblings, 1 reply; 29+ messages in thread From: Pieter Wuille @ 2017-02-28 23:24 UTC (permalink / raw) To: Bitcoin Dev, Bram Cohen [-- Attachment #1: Type: text/plain, Size: 1290 bytes --] On Feb 28, 2017 15:10, "Bram Cohen via bitcoin-dev" < bitcoin-dev@lists.linuxfoundation.org> wrote: On Tue, Feb 28, 2017 at 8:43 AM, G. Andrew Stone <g.andrew.stone@gmail.com> wrote: > > But since transactions' prevouts are not specified by [block height, tx > index, output index] or by TXO index, I don't understand how an insertion > ordered TXO tree can result in efficient lookups. Can you help me > understand this? > You have to have a lookup table going from prevouts to txo index. Lookups on that are relatively fast because looking up things in a hashtable is a single cache miss, while looking up things in a tree is logarithmic cache misses. I'm wondering if there is some confusion here. Yes, someone needs to have a lookup table from prevouts to TXO tree positions. But because an insertion-ordered TXO tree does not rebalance, that table can be maintained by wallets or service providers for just their own coins, instead of by every full node and miner individually for everyone's coins. In the simplest committed TXO model, full nodes simply maintain the TXO root hash, and every transaction/block comes with a proof that its inputs are in the TXO tree, and the necessary information to update the root after spending the inputs and adding the outputs. -- Pieter [-- Attachment #2: Type: text/html, Size: 2276 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-28 23:24 ` Pieter Wuille @ 2017-03-01 1:47 ` Bram Cohen 2017-03-01 1:56 ` Peter Todd 0 siblings, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-03-01 1:47 UTC (permalink / raw) To: Pieter Wuille; +Cc: Bitcoin Dev [-- Attachment #1: Type: text/plain, Size: 501 bytes --] On Tue, Feb 28, 2017 at 3:24 PM, Pieter Wuille <pieter.wuille@gmail.com> wrote: > > Yes, someone needs to have a lookup table from prevouts to TXO tree > positions. But because an insertion-ordered TXO tree does not rebalance, > that table can be maintained by wallets or service providers for just their > own coins, instead of by every full node and miner individually for > everyone's coins. > That falls apart if you want to support proofs of non-spend, which is the point of the whole exercise [-- Attachment #2: Type: text/html, Size: 951 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-03-01 1:47 ` Bram Cohen @ 2017-03-01 1:56 ` Peter Todd 0 siblings, 0 replies; 29+ messages in thread From: Peter Todd @ 2017-03-01 1:56 UTC (permalink / raw) To: Bram Cohen, Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 729 bytes --] On Tue, Feb 28, 2017 at 05:47:30PM -0800, Bram Cohen via bitcoin-dev wrote: > On Tue, Feb 28, 2017 at 3:24 PM, Pieter Wuille <pieter.wuille@gmail.com> > wrote: > > > > > Yes, someone needs to have a lookup table from prevouts to TXO tree > > positions. But because an insertion-ordered TXO tree does not rebalance, > > that table can be maintained by wallets or service providers for just their > > own coins, instead of by every full node and miner individually for > > everyone's coins. > > > > That falls apart if you want to support proofs of non-spend, which is the > point of the whole exercise Can you explain in more detail what you mean there? -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-02-25 6:23 ` Bram Cohen 2017-02-28 16:43 ` G. Andrew Stone @ 2017-03-01 22:31 ` Peter Todd 2017-03-31 20:38 ` Bram Cohen 1 sibling, 1 reply; 29+ messages in thread From: Peter Todd @ 2017-03-01 22:31 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 4056 bytes --] On Fri, Feb 24, 2017 at 10:23:20PM -0800, Bram Cohen wrote: > > Your merkle-set implementation is 1500 lines of densely written Python > > > The reference implementation which is included in those 1500 lines is less > than 300 lines and fairly straightforward. The non-reference implementation > always behaves semantically identically to the reference implementation, it > just does so faster and using less memory. Great! But do you see my point here? Even though I spent some time reading through that code, I didn't realise you had a 300 line reference implementation embedded within those 1500 lines. This makes it less likely for you to get any review on either. A better way to present your work would have been to at least explain that at the top of the file, and perhaps even better, split the reference implementation and optimized implementation into two separate files. If you did this, you'd be more likely to get others to review your work. > > with > > almost no comments, > > > The comments at the top explain both the proof format and the in-memory > data structures very precisely. The whole codebase was reviewed by a > coworker of mine and comments were added explaining the subtleties which > tripped him up. Yes, and it's good that you have those comments. But the codebase itself could definitely use more, and adding those comments would help get more people reviewing your work. > > and less than a 100 lines of (also uncommented) tests. > > > Those tests get 98% code coverage and extensively hit not only the lines of > code but the semantic edge cases as well. The lines which aren't hit are > convenience functions and error conditions of the parsing code for when > it's passed bad data Great! But you see how without comments, it'll take a tremendous amount of work for an external reviewer like myself to determine what is being tested, and what edge cases you're targeting. In fact, I'd suggest that for things like edge cases, you test edge cases in separate unit tests that explain what edge cases you're trying to catch. > > By > > comparison, my Python MMR implementation is 300 lines of very readable > > Python > > with lots of comments, a 200 line explanation at the top, and 200 lines of > > (commented) tests. Yet no-one is taking the (still considerable) effort to > > understand and comment on my implementation. :) > > > > Given that maaku's Merkle prefix trees were shelved because of performance > problems despite being written in C and operating in basically the same way > as your code and my reference code, it's clear that non-optimized Python > won't be touching the bitcoin codebase any time soon. To be clear, I gave my implementation as an example of how hard it is to get external review, not to suggest it's going to be a part of Bitcoin; I've pointed a lot of people to it when they asked for a MMR implementation, and I'm sure if some of those people had reviewed it carefully they would have suggested changes. Yet they haven't, because doing good review is a lot of work! > > In particular, while at the top of merkle_set.py you have a list of > > advantages, > > and a bunch of TODO's, you don't explain *why* the code has any of these > > advantages. To figure that out, I'd have to read and understand 1500 lines > > of > > densely written Python. Without a human-readable pitch, not many people are > > going to do that, myself included. > > > > It's all about cache coherence. When doing operations it pulls in a bunch > of things which are near each other in memory instead of jumping all over > the place. The improvements it gets should be much greater than the ones > gained from insertion ordering, although the two could be accretive. That's good, but that paragraph should be part of your MerkleSet git repo, preferably in the README, where reviewers will immediately find it and get excited about reviewing your code. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-03-01 22:31 ` Peter Todd @ 2017-03-31 20:38 ` Bram Cohen 2017-04-01 10:18 ` praxeology_guy 0 siblings, 1 reply; 29+ messages in thread From: Bram Cohen @ 2017-03-31 20:38 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 1261 bytes --] On Wed, Mar 1, 2017 at 2:31 PM, Peter Todd <pete@petertodd.org> wrote: > > A better way to present your work would have been to at least explain that > at > the top of the file, and perhaps even better, split the reference > implementation and optimized implementation into two separate files. If > you did > this, you'd be more likely to get others to review your work. > I've now added explanation to the README, reorganized the files, and added some comments: https://github.com/bramcohen/MerkleSet In fact, I'd suggest that for things like edge cases, you test edge cases in > separate unit tests that explain what edge cases you're trying to catch. > The tests work by doing a lot of exercising on pseudorandom data, an approach which does a good job of hitting all the lines of code and edge cases and requiring very little updating as the implementation changes, at the expense of it taking a while for tests to run. The advantage of very custom unit tests is that they run almost instantly, at the cost of requiring painstaking maintenance and missing more stuff. I've come to favor this approach in my old age. The proportion of code devoted to tests is more than it looks like at first blush, because all the audit methods are just for testing. [-- Attachment #2: Type: text/html, Size: 1913 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-03-31 20:38 ` Bram Cohen @ 2017-04-01 10:18 ` praxeology_guy 2017-04-01 19:46 ` praxeology_guy 0 siblings, 1 reply; 29+ messages in thread From: praxeology_guy @ 2017-04-01 10:18 UTC (permalink / raw) To: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 2906 bytes --] Peter Todd, This MMR structure looks good to me. I really like how wallets keep their MMR proof and txo index instead of requiring the entire network to maintain an index on txids w/ plain old utxo snapshots. Re: "only left or right child of inner node be a fully spent node"... that sounds fine to me, but the software should virtually consider that the previous dissapearing leaf nodes still exist. It would instead say be a special case handled by the meta hashing function. Would save a good amount of time from unneccesary hashing. Might also do the rule: if a parent node has a single fully spent child node, its hash is equal to its other child's hash. Below is questions about txo/utxo MMR commitments after reading: "https://petertodd.org/2016/delayed-txo-commitments". I'm mainly concerned about the performance of recalculating all of the node hashes on old spends. But probably with a long enough delay policy, it shouldn't be an issue. Then the issues with people keeping their MMR proofs up to date and discovering received txos before they get pruned. Sure would be nice if a wallet didn't have to keep on updating their MMR proof. Hopefully spends would refer to old txos by their MMR index. How are you ordering MMR additions? Are you only adding old utxos to the MMR? Or every old txo? I think you are doing all old txos (mostly would be spent nodes), but why not just old utxos? Are you doing it to make MMR index = blockchain txo index, so such an index can be used in all TX inputs except for non-confirmed transactions? Potentially a tx could use a MMR.ix, allblock'stxo.ix (if we want to maintian that index), or tx.id & vout.ix depending on how old the tx is. What is the process for removing old utxos from the utxo set, and placing them into the MMR? Are you going to keep height in the utxo db, and then iterate through the whole thing? Are you still proposing a 1 year txo commitment delay? Do you have any data/preformance studies judging the cost of a larger utxo and longer delay vs smaller utxo and shorder delay? I would figure a longer delay would be better as long as the utxo doesn't get too big. Longer delay means less MMR maintainance. Have you considered not making a new commitment on every block, instead maybe every 6*24 blocks or so? This could reduce the number of times the same nodes are re-hashed, while not having much impact on security. What about re-orgs? You'd have to restore the previous leaf & inner nodes. You'd need both the txos and MMR proofs, right? It looks like you clear this info from the TXO journal when the delay duration threshold is met. I guess this info might also be stored with the block data, and could be recovered from there. What are your current thoughts on block size weighting for MMR proofs? Just count the extra byte length that full nodes need to relay as simliar to SegWit witness data? Cheers, Praxeology Guy [-- Attachment #2: Type: text/html, Size: 3547 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [bitcoin-dev] A Better MMR Definition 2017-04-01 10:18 ` praxeology_guy @ 2017-04-01 19:46 ` praxeology_guy 0 siblings, 0 replies; 29+ messages in thread From: praxeology_guy @ 2017-04-01 19:46 UTC (permalink / raw) Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 2043 bytes --] gmaxwell told me that most nodes would keep a full copy of the top of the MMR tree. Here I am exploring how this could be policy-ized to solve two problems: - MMR proofs change over time - How to coordinate nodes to get them to keep different portions of the MMR, so that everyone can prune most of the structure, but the entire network still retains multiple copies of the full MMR. Define deltaLeafHeight as the number of tree layers between a node and the leaves in the MMR data structure. We make it a policy that nodes are expected to have all nodes above deltaLeafHeight = DLH_REQUIRED, but that nodes are free to prune any nodes with a deltaLeafHeight < DLH_REQUIRED. Of course a node could prune at DLH_REQUIRED or higher, but what I am proposing is that messages and proofs by default would only include nodes at deltaLeafHeight < DLH_REQUIRED. Given the above, If a wallet didn't want to be continuously concerned about updating their MMR proof for its coins, then for each coin: - store the set of utxo digests that are children of the "root nearest" node that is at deltaLeafHeight = DLH_REQUIRED. Call such a set of utxo digests the "pruned relatives". - Pruned relative count = 2^DLH_REQUIRED -1 - Guessing the spentness status of the pruned relatives would worst case take 2^(pruned relative count) guesses. - in the case where the MMR holds all txos (not just utxos at addition time)... the wallet should also keep record of which of the pruned relatives were utxos. - Any future information discovered about whether a pruned relative is spent would reduce the worst case guess count by a factor of 2. As an example, in the case where DLH_REQUIRED = 3: - pruned relative count = 7 - worst case spentness guess count = 128 Wallets storing the digests of pruned relatives could also help the entire network be able to discover otherwise lost portions of the MMR. If wallets stored not just the pruned relatives digests, but also their corresponding utxos, they could help other nodes find lost coins. Cheers, Praxeology Guy [-- Attachment #2: Type: text/html, Size: 2399 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2017-04-01 19:46 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-02-23 1:15 [bitcoin-dev] A Better MMR Definition Peter Todd 2017-02-23 3:07 ` Bram Cohen 2017-02-23 7:41 ` Peter Todd 2017-02-23 17:53 ` Chris Priest 2017-02-23 18:19 ` Peter Todd 2017-02-23 18:28 ` G. Andrew Stone 2017-02-23 18:31 ` Peter Todd 2017-02-23 23:13 ` Bram Cohen 2017-02-23 23:51 ` Peter Todd 2017-02-24 0:49 ` Bram Cohen 2017-02-24 1:09 ` Peter Todd 2017-02-24 2:50 ` Bram Cohen 2017-02-24 2:58 ` Peter Todd 2017-02-24 3:02 ` Bram Cohen 2017-02-24 3:15 ` Peter Todd 2017-02-24 3:32 ` Bram Cohen 2017-02-24 4:36 ` Peter Todd 2017-02-24 22:20 ` Bram Cohen 2017-02-25 4:12 ` Peter Todd 2017-02-25 6:23 ` Bram Cohen 2017-02-28 16:43 ` G. Andrew Stone 2017-02-28 23:10 ` Bram Cohen 2017-02-28 23:24 ` Pieter Wuille 2017-03-01 1:47 ` Bram Cohen 2017-03-01 1:56 ` Peter Todd 2017-03-01 22:31 ` Peter Todd 2017-03-31 20:38 ` Bram Cohen 2017-04-01 10:18 ` praxeology_guy 2017-04-01 19:46 ` praxeology_guy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox