* [bitcoin-dev] Merkle trees and mountain ranges @ 2016-06-15 0:14 Bram Cohen 2016-06-16 0:10 ` Peter Todd 0 siblings, 1 reply; 11+ messages in thread From: Bram Cohen @ 2016-06-15 0:14 UTC (permalink / raw) To: bitcoin-dev [-- Attachment #1: Type: text/plain, Size: 6336 bytes --] This is in response to Peter Todd's proposal for Merkle Mountain Range commitments in blocks: https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html I'm in strong agreement that there's a compelling need to put UTXO commitments in blocks, and that the big barrier to getting it done is performance, particularly latency. But I have strong disagreements (or perhaps the right word is skepticism) about the details. Peter proposes that there should be both UTXO and STXO commitments, and they should be based on Merkle Mountain Ranges based on Patricia Tries. My first big disagreement is about the need for STXO commitments. I think they're unnecessary and a performance problem. The STXO set is much larger than the utxo set and requires much more memory and horespower to maintain. Most if not all of its functionality can in practice be done using the utxo set. Almost anything accepting proofs of inclusion and exclusion will have a complete history of block headers, so to prove inclusion in the stxo set you can use a utxo proof of inclusion in the past and a proof of exclusion for the most recent block. In the case of a txo which has never been included at all, it's generally possible to show that an ancestor of the txo in question was at one point included but that an incompatible descendant of it (or the ancestor itself) is part of the current utxo set. Generating these sorts of proofs efficiently can for some applications require a complete STXO set, but that can done with a non-merkle set, getting the vastly better performance of an ordinary non-cryptographic hashtable. The fundamental approach to handling the latency problem is to have the utxo commitments trail a bit. Computing utxo commitments takes a certain amount of time, too much to hold up block propagation but still hopefully vastly less than the average amount of time between blocks. Trailing by a single block is probably a bad idea because you sometimes get blocks back to back, but you never get blocks back to back to back to back. Having the utxo set be trailing by a fixed amount - five blocks is probably excessive - would do a perfectly good job of keeping latency from every becoming an issue. Smaller commitments for the utxos added and removed in each block alone could be added without any significant performance penalty. That way all blocks would have sufficient commitments for a completely up to date proofs of inclusion and exclusion. This is not a controversial approach. Now I'm going to go out on a limb. My thesis is that usage of a mountain range is unnecessary, and that a merkle tree in the raw can be made serviceable by sprinkling magic pixie dust on the performance problem. There are two causes of performance problems for merkle trees: hashing operations and memory cache misses. For hashing functions, the difference between a mountain range and a straight merkle tree is roughly that in a mountain range there's one operation for each new update times the number of times that thing will get merged into larger hills. If there are fewer levels of hills the number of operations is less but the expense of proof of inclusion will be larger. For raw merkle trees the number of operations per thing added is the log base 2 of the number of levels in the tree, minus the log base 2 of the number of things added at once since you can do lazy evaluation. For practical Bitcoin there are (very roughly) a million things stored, or 20 levels, and there are (even more roughly) about a thousand things stored per block, so each thing forces about 20 - 10 = 10 operations. If you follow the fairly reasonable guideline of mountain range hills go up by factors of four, you instead have 20/2 = 10 operations per thing added amortized. Depending on details this comparison can go either way but it's roughly a wash and the complexity of a mountain range is clearly not worth it at least from the point of view of CPU costs. But CPU costs aren't the main performance problem in merkle trees. The biggest issues is cache misses, specifically l1 and l2 cache misses. These tend to take a long time to do, resulting in the CPU spending most of its time sitting around doing nothing. A naive tree implementation is pretty much the worst thing you can possibly build from a cache miss standpoint, and its performance will be completely unacceptable. Mountain ranges do a fabulous job of fixing this problem, because all their updates are merges so the metrics are more like cache misses per block instead of cache misses per transaction. The magic pixie dust I mentioned earlier involves a bunch of subtle implementation details to keep cache coherence down which should get the number of cache misses per transaction down under one, at which point it probably isn't a bottleneck any more. There is an implementation in the works here: https://github.com/bramcohen/MerkleSet This implementation isn't finished yet! I'm almost there, and I'm definitely feeling time pressure now. I've spent quite a lot of time on this, mostly because of a bunch of technical reworkings which proved necessary. This is the last time I ever write a database for kicks. But this implementation is good on all important dimensions, including: Lazy root calculation Few l1 and l2 cache misses Small proofs of inclusion/exclusion Reasonably simple implementation Reasonably efficient in memory Reasonable defense against malicious insertion attacks There is a bit of a false dichotomy with the mountain range approach. Mountain ranges need underlying merkle trees, and mine are semantically nearly identically to Peter's. This is not a coincidence - I adopted patricia tries at his suggestion. There are a bunch of small changes which allow a more efficient implementation. I believe that my underlying merkle tree is unambiguously superior in every way, but the question of whether a mountain range is worth it is one which can only be answered empirically, and that requires a bunch of implementation work to be done, starting with me finishing my merkle tree implementation and then somebody porting it to C and optimizing it. The Python version has details which are ridiculous and only make sense once it gets ported, and even under the best of conditions Python performance is not strongly indicative of C performance. [-- Attachment #2: Type: text/html, Size: 6926 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-15 0:14 [bitcoin-dev] Merkle trees and mountain ranges Bram Cohen @ 2016-06-16 0:10 ` Peter Todd 2016-06-16 1:16 ` Bram Cohen 2016-06-18 3:22 ` Bram Cohen 0 siblings, 2 replies; 11+ messages in thread From: Peter Todd @ 2016-06-16 0:10 UTC (permalink / raw) To: Bram Cohen, Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 7827 bytes --] On Tue, Jun 14, 2016 at 05:14:23PM -0700, Bram Cohen via bitcoin-dev wrote: > This is in response to Peter Todd's proposal for Merkle Mountain Range > commitments in blocks: > > https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html > > I'm in strong agreement that there's a compelling need to put UTXO > commitments in blocks, and that the big barrier to getting it done is > performance, particularly latency. But I have strong disagreements (or > perhaps the right word is skepticism) about the details. > > Peter proposes that there should be both UTXO and STXO commitments, and No, that's incorrect - I'm only proposing TXO commitments, not UTXO nor STXO commitments. > they should be based on Merkle Mountain Ranges based on Patricia Tries. My > first big disagreement is about the need for STXO commitments. I think > they're unnecessary and a performance problem. The STXO set is much larger > than the utxo set and requires much more memory and horespower to maintain. Again, I'm not proposing STXO commitments precisely because the set of _spent_ transactions grows without bound. TXO commitments with committed sums of remaining unspent TXO's and with pruning of old history are special in this regard, because once spent the data associated with spent transactions can be discarded completely, and at the same time, data associated with old history can be pruned with responsibility for keeping it resting on the shoulders of those owning those coins. > Most if not all of its functionality can in practice be done using the utxo > set. Almost anything accepting proofs of inclusion and exclusion will have > a complete history of block headers, so to prove inclusion in the stxo set > you can use a utxo proof of inclusion in the past and a proof of exclusion > for the most recent block. In the case of a txo which has never been > included at all, it's generally possible to show that an ancestor of the > txo in question was at one point included but that an incompatible > descendant of it (or the ancestor itself) is part of the current utxo set. > Generating these sorts of proofs efficiently can for some applications > require a complete STXO set, but that can done with a non-merkle set, > getting the vastly better performance of an ordinary non-cryptographic > hashtable. TXO commitments allows you to do all of this without requiring miners to have unbounded storage to create new blocks. > The fundamental approach to handling the latency problem is to have the > utxo commitments trail a bit. Computing utxo commitments takes a certain > amount of time, too much to hold up block propagation but still hopefully > vastly less than the average amount of time between blocks. Trailing by a > single block is probably a bad idea because you sometimes get blocks back > to back, but you never get blocks back to back to back to back. Having the > utxo set be trailing by a fixed amount - five blocks is probably excessive > - would do a perfectly good job of keeping latency from every becoming an > issue. Smaller commitments for the utxos added and removed in each block > alone could be added without any significant performance penalty. That way > all blocks would have sufficient commitments for a completely up to date > proofs of inclusion and exclusion. This is not a controversial approach. Agreed - regardless of approach adding latency to commitment calculations of all kinds is something I think we all agree can work in principle, although obviously it should be a last resort technique when optimization fails. > Now I'm going to go out on a limb. My thesis is that usage of a mountain > range is unnecessary, and that a merkle tree in the raw can be made > serviceable by sprinkling magic pixie dust on the performance problem. It'd help if you specified exactly what type of merkle tree you're talking about here; remember that the certificate transparency RFC appears to have reinvented merkle mountain ranges, and they call them "merkle trees". Bitcoin meanwhile uses a so-called "merkle tree" that's broken, and Zcash uses a partially filled fixed-sized perfect tree. > There are two causes of performance problems for merkle trees: hashing > operations and memory cache misses. For hashing functions, the difference > between a mountain range and a straight merkle tree is roughly that in a > mountain range there's one operation for each new update times the number > of times that thing will get merged into larger hills. If there are fewer > levels of hills the number of operations is less but the expense of proof > of inclusion will be larger. For raw merkle trees the number of operations > per thing added is the log base 2 of the number of levels in the tree, > minus the log base 2 of the number of things added at once since you can do > lazy evaluation. For practical Bitcoin there are (very roughly) a million > things stored, or 20 levels, and there are (even more roughly) about a > thousand things stored per block, so each thing forces about 20 - 10 = 10 > operations. If you follow the fairly reasonable guideline of mountain range > hills go up by factors of four, you instead have 20/2 = 10 operations per > thing added amortized. Depending on details this comparison can go either > way but it's roughly a wash and the complexity of a mountain range is > clearly not worth it at least from the point of view of CPU costs. I'm having a hard time understanding this paragraph; could you explain what you think is happening when things are "merged into larger hills"? > But CPU costs aren't the main performance problem in merkle trees. The > biggest issues is cache misses, specifically l1 and l2 cache misses. These > tend to take a long time to do, resulting in the CPU spending most of its > time sitting around doing nothing. A naive tree implementation is pretty > much the worst thing you can possibly build from a cache miss standpoint, > and its performance will be completely unacceptable. Mountain ranges do a > fabulous job of fixing this problem, because all their updates are merges > so the metrics are more like cache misses per block instead of cache misses > per transaction. > > The magic pixie dust I mentioned earlier involves a bunch of subtle > implementation details to keep cache coherence down which should get the > number of cache misses per transaction down under one, at which point it > probably isn't a bottleneck any more. There is an implementation in the > works here: As UTXO/STXO/TXO sets are all enormously larger than L1/L2 cache, it's impossible to get CPU cache misses below one for update operations. The closest thing to an exception is MMR's, which due to their insertion-ordering could have good cache locality for appends, in the sense that the mountain tips required to recalculate the MMR tip digest will already be in cache from the previous append. But that's not sufficient, as you also need to modify old TXO's further back in the tree to mark them as spent - that data is going to be far larger than L1/L2 cache. > https://github.com/bramcohen/MerkleSet > > This implementation isn't finished yet! I'm almost there, and I'm > definitely feeling time pressure now. I've spent quite a lot of time on > this, mostly because of a bunch of technical reworkings which proved > necessary. This is the last time I ever write a database for kicks. But > this implementation is good on all important dimensions, including: > > Lazy root calculation > Few l1 and l2 cache misses > Small proofs of inclusion/exclusion Have you looked at the pruning system that my proofchains work implements? -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-16 0:10 ` Peter Todd @ 2016-06-16 1:16 ` Bram Cohen 2016-06-16 3:26 ` Peter Todd 2016-06-18 3:22 ` Bram Cohen 1 sibling, 1 reply; 11+ messages in thread From: Bram Cohen @ 2016-06-16 1:16 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 6328 bytes --] On Wed, Jun 15, 2016 at 5:10 PM, Peter Todd <pete@petertodd.org> wrote: > On Tue, Jun 14, 2016 at 05:14:23PM -0700, Bram Cohen via bitcoin-dev wrote: > > > > Peter proposes that there should be both UTXO and STXO commitments, and > > No, that's incorrect - I'm only proposing TXO commitments, not UTXO nor > STXO > commitments. > What do you mean by TXO commitments? If you mean that it only records insertions rather than deletions, then that can do many of the same proofs but has no way of proving that something is currently in the UTXO set, which is functionality I'd like to provide. When I say 'merkle tree' what I mean is a patricia trie. What I assume is meant by a merkle mountain range is a series of patricia tries of decreasing size each of which is an addition to the previous one, and they're periodically consolidated into larger tries so the old ones can go away. This data structure has the nice property that it's both in sorted order and has less than one cache miss per operation because the consolidation operations can be batched and done linearly. There are a number of different things you could be describing if I misunderstood. > I'm not proposing STXO commitments precisely because the set of _spent_ > transactions grows without bound. I'm worried that once there's real transaction fees everyone might stop consolidating dust and the set of unspent transactions might grow without bound as well, but that's a topic for another day. > > Now I'm going to go out on a limb. My thesis is that usage of a mountain > > range is unnecessary, and that a merkle tree in the raw can be made > > serviceable by sprinkling magic pixie dust on the performance problem. > > It'd help if you specified exactly what type of merkle tree you're talking > about here; remember that the certificate transparency RFC appears to have > reinvented merkle mountain ranges, and they call them "merkle trees". > Bitcoin > meanwhile uses a so-called "merkle tree" that's broken, and Zcash uses a > partially filled fixed-sized perfect tree. > What I'm making is a patricia trie. Its byte level definition is very similar to the one in your MMR codebase. Each level of the tree has a single metadata byte and followed by two hashes. The hashes are truncated by one byte and the hash function is a non-padding variant of sha256 (right now it's just using regular sha256, but that's a nice optimization which allows everything to fit in a single block). The possible metadata values are: TERM0, TERM1, TERMBOTH, ONLY0, ONLY1, MIDDLE. They mean: TERM0, TERM1: There is a single thing in the tree on the specified side. The thing hashed on that side is that thing verbatim. The other side has more than one thing and the hash of it is the root of everything below. TERMBOTH: There are exactly two things below and they're included inline. Note that two things is a special case, if there are more you sometimes have ONLY0 or ONLY1. ONLY0, ONLY1: There are more than two things below and they're all on the same side. This makes proofs of inclusion and exclusion simpler, and makes some implementation details easier, for example there's always something at every level with perfect memory positioning. It doesn't cause much extra memory usage because of the TERMBOTH exception for exactly two things. MIDDLE: There two or more things on both sides. There's also a SINGLETON special case for a tree which contains only one thing, and an EMPTY special value for tree which doesn't contain anything. The main differences to your patricia trie are the non-padding sha256 and that each level doesn't hash in a record of its depth and the usage of ONLY0 and ONLY1. > > > There are two causes of performance problems for merkle trees: hashing > > operations and memory cache misses. For hashing functions, the difference > > between a mountain range and a straight merkle tree is roughly that in a > > mountain range there's one operation for each new update times the number > > of times that thing will get merged into larger hills. If there are fewer > > levels of hills the number of operations is less but the expense of proof > > of inclusion will be larger. For raw merkle trees the number of > operations > > per thing added is the log base 2 of the number of levels in the tree, > > minus the log base 2 of the number of things added at once since you can > do > > lazy evaluation. For practical Bitcoin there are (very roughly) a million > > things stored, or 20 levels, and there are (even more roughly) about a > > thousand things stored per block, so each thing forces about 20 - 10 = 10 > > operations. If you follow the fairly reasonable guideline of mountain > range > > hills go up by factors of four, you instead have 20/2 = 10 operations per > > thing added amortized. Depending on details this comparison can go either > > way but it's roughly a wash and the complexity of a mountain range is > > clearly not worth it at least from the point of view of CPU costs. > > I'm having a hard time understanding this paragraph; could you explain > what you > think is happening when things are "merged into larger hills"? > I'm talking about the recalculation of mountain tips, assuming we're on the same page about what 'MMR' means. > As UTXO/STXO/TXO sets are all enormously larger than L1/L2 cache, it's > impossible to get CPU cache misses below one for update operations. The > closest > thing to an exception is MMR's, which due to their insertion-ordering could > have good cache locality for appends, in the sense that the mountain tips > required to recalculate the MMR tip digest will already be in cache from > the > previous append. But that's not sufficient, as you also need to modify old > TXO's further back in the tree to mark them as spent - that data is going > to be > far larger than L1/L2 cache. > This makes me think we're talking about subtly different things for MMRs. The ones I described above have sub-1 cache miss per update due to the amortized merging together of old mountains. Technically even a patricia trie utxo commitment can have sub-1 cache misses per update if some of the updates in a single block are close to each other in memory. I think I can get practical Bitcoin updates down to a little bit less than one l2 cache miss per update, but not a lot less. [-- Attachment #2: Type: text/html, Size: 7897 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-16 1:16 ` Bram Cohen @ 2016-06-16 3:26 ` Peter Todd 2016-06-16 9:07 ` Bram Cohen 0 siblings, 1 reply; 11+ messages in thread From: Peter Todd @ 2016-06-16 3:26 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 6753 bytes --] On Wed, Jun 15, 2016 at 06:16:26PM -0700, Bram Cohen wrote: > On Wed, Jun 15, 2016 at 5:10 PM, Peter Todd <pete@petertodd.org> wrote: > > > On Tue, Jun 14, 2016 at 05:14:23PM -0700, Bram Cohen via bitcoin-dev wrote: > > > > > > Peter proposes that there should be both UTXO and STXO commitments, and > > > > No, that's incorrect - I'm only proposing TXO commitments, not UTXO nor > > STXO > > commitments. > > > > What do you mean by TXO commitments? If you mean that it only records > insertions rather than deletions, then that can do many of the same proofs > but has no way of proving that something is currently in the UTXO set, > which is functionality I'd like to provide. I think you need to re-read my original post on TXO commitments, specifically where I say: # TXO Commitments A merkle tree committing to the state of __all transaction outputs, both spent and unspent__, we can provide a method of compactly proving the current state of an output. https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html > When I say 'merkle tree' what I mean is a patricia trie. What I assume is > meant by a merkle mountain range is a series of patricia tries of > decreasing size each of which is an addition to the previous one, and > they're periodically consolidated into larger tries so the old ones can go > away. This data structure has the nice property that it's both in sorted > order and has less than one cache miss per operation because the > consolidation operations can be batched and done linearly. There are a > number of different things you could be describing if I misunderstood. Nope, MMR's are completely unlike what you just described. > > I'm not proposing STXO commitments precisely because the set of _spent_ > > transactions grows without bound. > > > I'm worried that once there's real transaction fees everyone might stop > consolidating dust and the set of unspent transactions might grow without > bound as well, but that's a topic for another day. Ok, but then if you're concerned about that risk, why introduce a data structure - the STXO set - that's _guaranteed_ to grow without bound? > > > Now I'm going to go out on a limb. My thesis is that usage of a mountain > > > range is unnecessary, and that a merkle tree in the raw can be made > > > serviceable by sprinkling magic pixie dust on the performance problem. > > > > It'd help if you specified exactly what type of merkle tree you're talking > > about here; remember that the certificate transparency RFC appears to have > > reinvented merkle mountain ranges, and they call them "merkle trees". > > Bitcoin > > meanwhile uses a so-called "merkle tree" that's broken, and Zcash uses a > > partially filled fixed-sized perfect tree. > > > > What I'm making is a patricia trie. Its byte level definition is very > similar to the one in your MMR codebase. Which codebase exactly? I have both a insertion-ordered list (MMR) and a key:value mapping (referred to as a "merbinner tree" in the codebase) in the proofchains codebase. They're very different data structures. > Each level of the tree has a single metadata byte and followed by two > hashes. The hashes are truncated by one byte and the hash function is a > non-padding variant of sha256 (right now it's just using regular sha256, > but that's a nice optimization which allows everything to fit in a single > block). > > The possible metadata values are: TERM0, TERM1, TERMBOTH, ONLY0, ONLY1, > MIDDLE. They mean: > > TERM0, TERM1: There is a single thing in the tree on the specified side. > The thing hashed on that side is that thing verbatim. The other side has > more than one thing and the hash of it is the root of everything below. > > TERMBOTH: There are exactly two things below and they're included inline. > Note that two things is a special case, if there are more you sometimes > have ONLY0 or ONLY1. > > ONLY0, ONLY1: There are more than two things below and they're all on the > same side. This makes proofs of inclusion and exclusion simpler, and makes > some implementation details easier, for example there's always something at > every level with perfect memory positioning. It doesn't cause much extra > memory usage because of the TERMBOTH exception for exactly two things. > > MIDDLE: There two or more things on both sides. > > There's also a SINGLETON special case for a tree which contains only one > thing, and an EMPTY special value for tree which doesn't contain anything. > > The main differences to your patricia trie are the non-padding sha256 and > that each level doesn't hash in a record of its depth and the usage of > ONLY0 and ONLY1. I'm rather confused, as the above sounds nothing like what I've implemented, which only has leaf nodes, inner nodes, and the special empty node singleton, for both the MMR and merbinner trees. > > I'm having a hard time understanding this paragraph; could you explain > > what you > > think is happening when things are "merged into larger hills"? > > > > I'm talking about the recalculation of mountain tips, assuming we're on the > same page about what 'MMR' means. Yeah, we're definitely not... In MMR's append operations never need to modify mountain contents. > > As UTXO/STXO/TXO sets are all enormously larger than L1/L2 cache, it's > > impossible to get CPU cache misses below one for update operations. The > > closest > > thing to an exception is MMR's, which due to their insertion-ordering could > > have good cache locality for appends, in the sense that the mountain tips > > required to recalculate the MMR tip digest will already be in cache from > > the > > previous append. But that's not sufficient, as you also need to modify old > > TXO's further back in the tree to mark them as spent - that data is going > > to be > > far larger than L1/L2 cache. > > > > This makes me think we're talking about subtly different things for MMRs. > The ones I described above have sub-1 cache miss per update due to the > amortized merging together of old mountains. Again, see above. > Technically even a patricia trie utxo commitment can have sub-1 cache > misses per update if some of the updates in a single block are close to > each other in memory. I think I can get practical Bitcoin updates down to a > little bit less than one l2 cache miss per update, but not a lot less. I'm very confused as to why you think that's possible. When you say "practical Bitcoin updates", what exactly is the data structure you're proposing to update? How is it indexed? -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-16 3:26 ` Peter Todd @ 2016-06-16 9:07 ` Bram Cohen 2016-06-17 4:34 ` Peter Todd 0 siblings, 1 reply; 11+ messages in thread From: Bram Cohen @ 2016-06-16 9:07 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 5620 bytes --] On Wed, Jun 15, 2016 at 8:26 PM, Peter Todd <pete@petertodd.org> wrote: > > > > What do you mean by TXO commitments? If you mean that it only records > > insertions rather than deletions, then that can do many of the same > proofs > > but has no way of proving that something is currently in the UTXO set, > > which is functionality I'd like to provide. > > I think you need to re-read my original post on TXO commitments, > specifically > where I say: > > # TXO Commitments > > A merkle tree committing to the state of __all transaction outputs, > both spent > and unspent__, we can provide a method of compactly proving the > current state of > an output. > > > https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2016-May/012715.html Okay, clearly my assumptions about the parts of that post I didn't read carefully were way off. I'll have to look through it carefully to be able to make coherent apples to apples comparisons. > I'm worried that once there's real transaction fees everyone might stop > > consolidating dust and the set of unspent transactions might grow without > > bound as well, but that's a topic for another day. > > Ok, but then if you're concerned about that risk, why introduce a data > structure - the STXO set - that's _guaranteed_ to grow without bound? > I'm not proposing STXO set commitments either. My point was that there should be incentives for collecting dust. That has nothing to do with this thread though and should be discussed separately (also I don't feel like discussing it because I don't have a good proposal). > > What I'm making is a patricia trie. Its byte level definition is very > > similar to the one in your MMR codebase. > > Which codebase exactly? I have both a insertion-ordered list (MMR) and a > key:value mapping (referred to as a "merbinner tree" in the codebase) in > the > proofchains codebase. They're very different data structures. > I'm talking about your merbinner trees. I read through that part of your codebase carefully and got the impression that the MMR tree section used it as a building block. > > The main differences to your patricia trie are the non-padding sha256 and > > that each level doesn't hash in a record of its depth and the usage of > > ONLY0 and ONLY1. > > I'm rather confused, as the above sounds nothing like what I've > implemented, > which only has leaf nodes, inner nodes, and the special empty node > singleton, > for both the MMR and merbinner trees. > It's quite a bit like merbinner trees. I've basically taken the leaf nodes and smushed them into the inner nodes above them, thus saving a hashing operation and some memory. They're both binary radix trees. > Technically even a patricia trie utxo commitment can have sub-1 cache > > misses per update if some of the updates in a single block are close to > > each other in memory. I think I can get practical Bitcoin updates down > to a > > little bit less than one l2 cache miss per update, but not a lot less. > > I'm very confused as to why you think that's possible. When you say > "practical > Bitcoin updates", what exactly is the data structure you're proposing to > update? How is it indexed? My calculations are: a Bitcoin block contains about 2000 updates. The l2 cache is about 256 kilobytes, and if an update is about 32 bytes times two for the parents, grandparents, etc. then an l2 cache can contain about 4000 values. If the current utxo size is about 2000 * 4000 = 8,000,000 in size then about half the pages which contain a transaction will contain a second one. I think the utxo set is currently about an order of magnitude greater than that, so the number of such collisions will be fairly mall, hence my 'less than one but not a lot less' comment. As for how it's indexed, at a crypto definition level it's just a binary radix tree. In terms of how it's indexed in memory, that involves some optimizations to avoid cache misses. Memory is allocated into blocks of about the size of an 12 cache (or maybe an l1 cache, it will require some testing and optimization). Blocks are either branch blocks, which keep everything in fixed positions, or leaf blocks, which contain fixed size entries for nodes plus indexes within the same leaf block of their children. Branch blocks can have many children which can be either branch blocks or leaf blocks, but typically are either all branch blocks or all leaf blocks. Branch blocks always have exactly one parent. Leaf blocks always have all their inputs come from a single branch block, but there can be multiple ones of those. When a branch block overflows it first tries to put stuff into the last leaf block it used, and if there's no more room it allocates a new one. It's fairly common for branches to have just a few leaf children, but they also could have a lot, depending on whether the base 2 log of the number of things currently in the set modulo the number levels in a branch is a small number. Usually when an update is done it consists of first checking the appropriate output of the root block (it's jumped to directly to avoid unnecessary memory lookups. If there's nothing there the algorithm will walk back until it finds something.) That leads directly to (usually) another branch whose output is jumped to directly again. At Bitcoin utxo set sizes that will usually lead to a leaf block, which is then walked down manually to find the actual terminal node, which is then updated, and the parent, grandparent, etc. is then marked invalid until something which was already marked invalid is hit, and it exits. Calculation of hash values is done lazily. [-- Attachment #2: Type: text/html, Size: 7149 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-16 9:07 ` Bram Cohen @ 2016-06-17 4:34 ` Peter Todd 2016-06-18 2:43 ` Bram Cohen 0 siblings, 1 reply; 11+ messages in thread From: Peter Todd @ 2016-06-17 4:34 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 9279 bytes --] On Thu, Jun 16, 2016 at 02:07:26AM -0700, Bram Cohen wrote: > On Wed, Jun 15, 2016 at 8:26 PM, Peter Todd <pete@petertodd.org> wrote: > Okay, clearly my assumptions about the parts of that post I didn't read > carefully were way off. I'll have to look through it carefully to be able > to make coherent apples to apples comparisons. Thanks! > > I'm worried that once there's real transaction fees everyone might stop > > > consolidating dust and the set of unspent transactions might grow without > > > bound as well, but that's a topic for another day. > > > > Ok, but then if you're concerned about that risk, why introduce a data > > structure - the STXO set - that's _guaranteed_ to grow without bound? > > > > I'm not proposing STXO set commitments either. My point was that there > should be incentives for collecting dust. That has nothing to do with this > thread though and should be discussed separately (also I don't feel like > discussing it because I don't have a good proposal). Ah, yeah, I misunderstood you there; as expected absolutely no-one is proposing STXO set commitments. :) > > > The main differences to your patricia trie are the non-padding sha256 and > > > that each level doesn't hash in a record of its depth and the usage of > > > ONLY0 and ONLY1. > > > > I'm rather confused, as the above sounds nothing like what I've > > implemented, > > which only has leaf nodes, inner nodes, and the special empty node > > singleton, > > for both the MMR and merbinner trees. > > > > It's quite a bit like merbinner trees. I've basically taken the leaf nodes > and smushed them into the inner nodes above them, thus saving a hashing > operation and some memory. They're both binary radix trees. Ah, I see what you mean now. So above you said that in merbinner trees each node "hash[es] in a record of its depth" That's actually incorrect: each node commits to the prefix that all keys below that level start with, not just the depth. This means that in merbinner trees, cases where multiple keys share parts of the same prefix are handled efficiently, without introducing extra levels unnecessarily; there's no need for the ONLY0/1 nodes as the children of an inner node will always be on different sides. When keys are randomly distributed, this isn't a big deal; OTOH against attackers who are choosing keys, e.g. by grinding hashes, merbinner trees always have maximum depths in proportion to log2(n) of the actual number of items in the tree. Grinding is particularly annoying to deal with due to the birthday attack: creating a ground prefix 64 bits long only takes 32 bits worth of work. In my deterministic expressions work one of the ideas I've been tossing around is rather than always using hash digests directly for when you need to commit to some data, we could instead extend the idea of a digest to that of a "commitment", where a commitment is simply some short, but variable-sized, string that uniquely maps to a given set of data. Secondly, commitments do *not* always guarantee that the original data can't be recovered from the commitment itself. By allowing commitments to be variable sized - say 0 to ~64 bytes - we get a number of advantages: 1) Data shorter than the length of a digest (32 bytes) can be included in the commitment itself, improving efficiency. 2) Data a little longer than a digest can have hashing delayed, to better fill up blocks. In particular, case #2 handles your leaf node optimizations generically, without special cases and additional complexity. It'd also be a better way to do the ONLY0/1 cases, as if the "nothing on this side" symbol is a single byte, each additional colliding level would simply extend the commitment without hashing. In short, you'd have nearly the same level of optimization even if at the cryptography level your tree consists of only leaves, inner nodes, and nil. Another advantage of variable sized commitments is that it can help make clear to users when it's possible to brute force the message behind the commitment. For instance, digest from a hashed four byte integer can be trivially reversed by just trying all combinations. Equally, if that integer is concatenated with a 32 byte digest that the attacker knows, the value of the integer can be brute forced. > > Technically even a patricia trie utxo commitment can have sub-1 cache > > > misses per update if some of the updates in a single block are close to > > > each other in memory. I think I can get practical Bitcoin updates down > > to a > > > little bit less than one l2 cache miss per update, but not a lot less. > > > > I'm very confused as to why you think that's possible. When you say > > "practical > > Bitcoin updates", what exactly is the data structure you're proposing to > > update? How is it indexed? > > > My calculations are: a Bitcoin block contains about 2000 updates. The l2 > cache is about 256 kilobytes, and if an update is about 32 bytes times two > for the parents, grandparents, etc. then an l2 cache can contain about 4000 > values. If the current utxo size is about 2000 * 4000 = 8,000,000 in size > then about half the pages which contain a transaction will contain a second > one. I think the utxo set is currently about an order of magnitude greater > than that, so the number of such collisions will be fairly mall, hence my > 'less than one but not a lot less' comment. Your estimate of updates requiring 32 bytes of data is *way* off. Each inner node updated on the path to a leaf node will itself require 32 bytes of data to be fetched - the digest of the sibling. As of block 416,628, there are 39,167,128 unspent txouts, giving us a tree about 25 levels deep. So if I want to update a single leaf, I need to read: 25 nodes * 32 bytes/node = 800 bytes of data. Naively, that'd mean our 2,000 updates needs to read 1.6MB from RAM, which is 6.4x bigger than the L2 cache - it's just not going to fit. Taking into account the fact that this is a batched update improves things a little bit. For a node at level i with random access patterns and N accesses total our amortised cost is 1/(1 + N/2^i) Summing that over 2,000 leaf updates and 25 levels gives us ~29,000 total updates, 0.9MB, which is still a lot larger than L2 cache. While this might fit in L3 cache - usually on the order of megabytes - this is a rather optimistic scenario anyway: we're assuming no other cache pressure and 100% hit rate. Anyway hashing is pretty slow. The very fast BLAKE2 is about 3 cycles/byte (SHA256 is about 15 cycles/byte) so hashing that same data would take around 200 cycles, and probably quite a bit more in practice due to overheads from our short message lengths; fetching a cache line from DRAM only takes about 1,000 cycles. I'd guess that once other overheads are taken into account, even if you could eliminate L2/L3 cache-misses it wouldn't be much of an improvement. > As for how it's indexed, at a crypto definition level it's just a binary > radix tree. In terms of how it's indexed in memory, that involves some > optimizations to avoid cache misses. Memory is allocated into blocks of > about the size of an 12 cache (or maybe an l1 cache, it will require some > testing and optimization). Blocks are either branch blocks, which keep > everything in fixed positions, or leaf blocks, which contain fixed size > entries for nodes plus indexes within the same leaf block of their > children. Branch blocks can have many children which can be either branch > blocks or leaf blocks, but typically are either all branch blocks or all > leaf blocks. Branch blocks always have exactly one parent. Leaf blocks > always have all their inputs come from a single branch block, but there can > be multiple ones of those. When a branch block overflows it first tries to > put stuff into the last leaf block it used, and if there's no more room it > allocates a new one. It's fairly common for branches to have just a few > leaf children, but they also could have a lot, depending on whether the > base 2 log of the number of things currently in the set modulo the number > levels in a branch is a small number. > > Usually when an update is done it consists of first checking the > appropriate output of the root block (it's jumped to directly to avoid > unnecessary memory lookups. If there's nothing there the algorithm will > walk back until it finds something.) That leads directly to (usually) > another branch whose output is jumped to directly again. At Bitcoin utxo > set sizes that will usually lead to a leaf block, which is then walked down > manually to find the actual terminal node, which is then updated, and the > parent, grandparent, etc. is then marked invalid until something which was > already marked invalid is hit, and it exits. Calculation of hash values is > done lazily. I think it's safe to say that given our working set is significantly larger than the L2/L3 cache available, none of the above optimizations are likely to help much. Better to just keep the codebase simple and use standard techniques. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-17 4:34 ` Peter Todd @ 2016-06-18 2:43 ` Bram Cohen 2016-06-18 23:01 ` Peter Todd 0 siblings, 1 reply; 11+ messages in thread From: Bram Cohen @ 2016-06-18 2:43 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 9205 bytes --] On Thu, Jun 16, 2016 at 9:34 PM, Peter Todd <pete@petertodd.org> wrote: > So above you said that in merbinner trees each node "hash[es] in a record > of > its depth" That's actually incorrect: each node commits to the prefix that > all > keys below that level start with, not just the depth. I considered a similar trick at the implementation rather than the definition level: A node doesn't have to store the prefix which is implicit in its position. That would create a fair number of headaches though, because I'm using fixed size stuff in important ways, and it could at most save about 10% of memory, so it goes into the 'maybe later' bucket. > > This means that in merbinner trees, cases where multiple keys share parts > of > the same prefix are handled efficiently, without introducing extra levels > unnecessarily; there's no need for the ONLY0/1 nodes as the children of an > inner node will always be on different sides. > > When keys are randomly distributed, this isn't a big deal; OTOH against > attackers who are choosing keys, e.g. by grinding hashes, merbinner trees > always have maximum depths in proportion to log2(n) of the actual number of > items in the tree. Grinding is particularly annoying to deal with due to > the > birthday attack: creating a ground prefix 64 bits long only takes 32 bits > worth > of work. > Yes an attacker can force the tree to be deeper in places, but it's mitigated in several ways: (1) The way I'm using memory it won't cause a whole new block to be allocated, it will just force log(attack strength) - log(n) nodes to be used (2) logarithmic growth being what it is that isn't such a big amount (3) With the special casing of TERMBOTH an attacker needs three things with the same prefix to pull off an attack rather than two, which is quite a bit harder to pull off. That said, it wouldn't be all that hard to change how the hashing function works to do a single hash for a whole series of ONLY in a row instead of a new one at every level, which would make the attacker only able to force extra memory usage instead of extra CPU, but this is a slightly annoying thing to write to stop a fairly lame attack, so I'm at least not doing it for my initial implementation. I could likely be convinced that it's worth doing before an actual release though. There's another implementation trick to do the same thing for memory usage, which is much more in the 'do later' category because it doesn't involve changing the format and hence it can be put off. > In particular, case #2 handles your leaf node optimizations generically, > without special cases and additional complexity. It'd also be a better way > to > do the ONLY0/1 cases, as if the "nothing on this side" symbol is a single > byte, > each additional colliding level would simply extend the commitment without > hashing. In short, you'd have nearly the same level of optimization even > if at > the cryptography level your tree consists of only leaves, inner nodes, and > nil. > I'm taking pains to make all the hashing be of fixed-size things, so that a non-padding variant of a secure hashing algorithm can be used. The chains of ONLY thing above would force a special exception to that, which can be done but is annoying. Making things smaller than a single block (64 bytes) won't speed up hashing time, and making things a single byte longer than that doubles it. > Another advantage of variable sized commitments is that it can help make > clear > to users when it's possible to brute force the message behind the > commitment. > For instance, digest from a hashed four byte integer can be trivially > reversed > by just trying all combinations. Equally, if that integer is concatenated > with > a 32 byte digest that the attacker knows, the value of the integer can be > brute > forced. > I'm hashing all strings before inserting to get them to be a fixed size and avoid a few different attacks. In Bitcoin most of the strings added are longer than that so it's a form of compression. A custom hash function could be used which 'hashes' very short strings by repeating them verbatim could be used, but seems like not such a hot idea. I'm making extensive use of things being fixed size everywhere, which improves performance in a lot of ways. > > > Technically even a patricia trie utxo commitment can have sub-1 cache > > > > misses per update if some of the updates in a single block are close > to > > > > each other in memory. I think I can get practical Bitcoin updates > down > > > to a > > > > little bit less than one l2 cache miss per update, but not a lot > less. > > > > > > I'm very confused as to why you think that's possible. When you say > > > "practical > > > Bitcoin updates", what exactly is the data structure you're proposing > to > > > update? How is it indexed? > I'll re-answer this because I did a terrible job before. The entire data structure consists of nodes which contain a metadata byte (TERM0, ONLY1, etc.) followed by fixes size secure hashes, and (in some cases) pointers to where the children are. The secure hashes in parent nodes are found by hashing their children verbatim (or the stored string in the case of a TERM). This is very conventional. All of the cleverness is in where in memory these nodes are stored so that tracing down the tree causes very few cache misses. (The alternate approach is to have each node store its own hash rather than that be stored by the parent. That approach means that when you're recalculating you have to look up siblings which doubles the number of cache misses. Not such a hot idea.) At the root there's a branch block. It consists of all nodes up to some fixed depth - let's say 12 - with that depth set so that it roughly fits within a single memory page. Branch blocks are arranged with the nodes in fixed position defined by the prefix they correspond to, and the terminals have outpointers to other blocks. Because they're all clustered together, a lookup or update will only require a single Below the root block are other branch blocks. Each of them has a fixed 12 bit prefix it is responsible for. When doing a lookup a second cache miss will be hit for levels 13-24, because those are all clustered in the same branch block. Below the second level of root block (at Bitcoin utxo set scale - this varies based on how much is stored) there are leaf blocks. A leaf block consists of nodes with outpointers to its own children which must be within the same leaf block. All entry points into a leaf block are from the same branch block, and the leaf block has no out pointers to other blocks. When a leaf block overflows the entry point into it which overflowed is moved into the active leaf for that branch, and if that's full a new one is allocated. There's some subtlety to exactly how this is done, but I've gotten everything down to simple expedient tricks with straightforward implementations. The thing which matters for now is that there's only a single cache miss for each leaf node, because they also fit in a page. So at Bitcoin scale there will probably only be 3 cache misses for a lookup, and that's a worst case scenario. The first one is probably always warm, bringing it down to 2, and if you do a bunch in sorted order they'll probably hit the same second level branches repeatedly bringing it down to 1, and might even average less than that if there are enough that the leaf block has multiple things being accessed. (These same tricks can be applied to merbinner tree implementation as well, although that would be a bit less parsimonious with memory, by a small constant factor.) > Anyway hashing is pretty slow. The very fast BLAKE2 is about 3 cycles/byte > (SHA256 is about 15 cycles/byte) so hashing that same data would take > around > 200 cycles, and probably quite a bit more in practice due to overheads > from our > short message lengths; fetching a cache line from DRAM only takes about > 1,000 > cycles. I'd guess that once other overheads are taken into account, even > if you > could eliminate L2/L3 cache-misses it wouldn't be much of an improvement. > Those numbers actually back up my claims about performance. If you're doing a single update and recalculating the root afterwards, then the amount of rehashing to be done is about 30 levels deep times 64 bytes per thing hashed times 15 cycles per byte then it's about 28,800 cycles of hashing. If you have a naive radix tree implementation which hits a cache miss at every level then that's 30,000 cycles, which is about half the performance time, certainly worth optimizing. If instead of sha256 you use blake2 (Which sounds like a very good idea!) then hashing for an update will be about 5760 cycles and performance will be completely dominated by cache misses. If a more cache coherent implementation is used, then the cost of cache misses will be 3000 cycles, which will be a non-factor with sha256 and a significant but not dominating one with blake2. It's reasonable to interpret those numbers as saying that blake2 and cache coherent implementation are both clearly worth it (probably necessary for real adoption) and that an amortized binary radix tree is tempting but not worth it. [-- Attachment #2: Type: text/html, Size: 10852 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-18 2:43 ` Bram Cohen @ 2016-06-18 23:01 ` Peter Todd 2016-07-15 23:00 ` Bram Cohen 0 siblings, 1 reply; 11+ messages in thread From: Peter Todd @ 2016-06-18 23:01 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 10002 bytes --] On Fri, Jun 17, 2016 at 07:43:47PM -0700, Bram Cohen wrote: > On Thu, Jun 16, 2016 at 9:34 PM, Peter Todd <pete@petertodd.org> wrote: > > > So above you said that in merbinner trees each node "hash[es] in a record > > of > > its depth" That's actually incorrect: each node commits to the prefix that > > all > > keys below that level start with, not just the depth. > > > I considered a similar trick at the implementation rather than the > definition level: A node doesn't have to store the prefix which is implicit > in its position. That would create a fair number of headaches though, > because I'm using fixed size stuff in important ways, and it could at most > save about 10% of memory, so it goes into the 'maybe later' bucket. Wait, are you saying you think committing to the prefix is a "trick"? It's just a very simple - and possibly not-optimal - way of committing to what data should be accessible under a given node. An alternative would have been ensure that in terms of _cryptographic_ tree position. By "position", are you talking about position within RAM? That may or may not be a viable optimization, but it's quite separate from the question of the cryptographic structure of the data. > > This means that in merbinner trees, cases where multiple keys share parts > > of > > the same prefix are handled efficiently, without introducing extra levels > > unnecessarily; there's no need for the ONLY0/1 nodes as the children of an > > inner node will always be on different sides. > > > > When keys are randomly distributed, this isn't a big deal; OTOH against > > attackers who are choosing keys, e.g. by grinding hashes, merbinner trees > > always have maximum depths in proportion to log2(n) of the actual number of > > items in the tree. Grinding is particularly annoying to deal with due to > > the > > birthday attack: creating a ground prefix 64 bits long only takes 32 bits > > worth > > of work. > > > > Yes an attacker can force the tree to be deeper in places, but it's > mitigated in several ways: (1) The way I'm using memory it won't cause a > whole new block to be allocated, it will just force log(attack strength) - > log(n) nodes to be used (2) logarithmic growth being what it is that isn't > such a big amount (3) With the special casing of TERMBOTH an attacker needs > three things with the same prefix to pull off an attack rather than two, > which is quite a bit harder to pull off. > That said, it wouldn't be all that hard to change how the hashing function > works to do a single hash for a whole series of ONLY in a row instead of a > new one at every level, which would make the attacker only able to force > extra memory usage instead of extra CPU, but this is a slightly annoying > thing to write to stop a fairly lame attack, so I'm at least not doing it > for my initial implementation. I could likely be convinced that it's worth > doing before an actual release though. There's another implementation trick > to do the same thing for memory usage, which is much more in the 'do later' > category because it doesn't involve changing the format and hence it can be > put off. > > > > In particular, case #2 handles your leaf node optimizations generically, > > without special cases and additional complexity. It'd also be a better way > > to > > do the ONLY0/1 cases, as if the "nothing on this side" symbol is a single > > byte, > > each additional colliding level would simply extend the commitment without > > hashing. In short, you'd have nearly the same level of optimization even > > if at > > the cryptography level your tree consists of only leaves, inner nodes, and > > nil. > > > > I'm taking pains to make all the hashing be of fixed-size things, so that a > non-padding variant of a secure hashing algorithm can be used. The chains > of ONLY thing above would force a special exception to that, which can be > done but is annoying. Making things smaller than a single block (64 bytes) > won't speed up hashing time, and making things a single byte longer than > that doubles it. Have you seen how BLAKE2 omits padding when the data to be hashed happens to be exactly one block in size? It's significantly faster than SHA256, and that's a standard part of the algorithm already. > > > > Technically even a patricia trie utxo commitment can have sub-1 cache > > > > > misses per update if some of the updates in a single block are close > > to > > > > > each other in memory. I think I can get practical Bitcoin updates > > down > > > > to a > > > > > little bit less than one l2 cache miss per update, but not a lot > > less. > > > > > > > > I'm very confused as to why you think that's possible. When you say > > > > "practical > > > > Bitcoin updates", what exactly is the data structure you're proposing > > to > > > > update? How is it indexed? > > > > I'll re-answer this because I did a terrible job before. The entire data > structure consists of nodes which contain a metadata byte (TERM0, ONLY1, > etc.) followed by fixes size secure hashes, and (in some cases) pointers to > where the children are. The secure hashes in parent nodes are found by > hashing their children verbatim (or the stored string in the case of a > TERM). This is very conventional. All of the cleverness is in where in > memory these nodes are stored so that tracing down the tree causes very few > cache misses. > > (The alternate approach is to have each node store its own hash rather than > that be stored by the parent. That approach means that when you're > recalculating you have to look up siblings which doubles the number of > cache misses. Not such a hot idea.) Have you benchmarked the cost of a hash operation vs. the cost of a cache miss? What are the actual numbers? > At the root there's a branch block. It consists of all nodes up to some > fixed depth - let's say 12 - with that depth set so that it roughly fits > within a single memory page. Branch blocks are arranged with the nodes in > fixed position defined by the prefix they correspond to, and the terminals > have outpointers to other blocks. Because they're all clustered together, a > lookup or update will only require a single A single....? > Below the root block are other branch blocks. Each of them has a fixed 12 > bit prefix it is responsible for. When doing a lookup a second cache miss > will be hit for levels 13-24, because those are all clustered in the same > branch block. So, is this also how the data structure looks cryptographically, or is the way it's hashed separate from the above description? > Below the second level of root block (at Bitcoin utxo set scale - this > varies based on how much is stored) there are leaf blocks. A leaf block > consists of nodes with outpointers to its own children which must be within > the same leaf block. All entry points into a leaf block are from the same > branch block, and the leaf block has no out pointers to other blocks. When > a leaf block overflows the entry point into it which overflowed is moved > into the active leaf for that branch, and if that's full a new one is > allocated. There's some subtlety to exactly how this is done, but I've > gotten everything down to simple expedient tricks with straightforward > implementations. The thing which matters for now is that there's only a > single cache miss for each leaf node, because they also fit in a page. Page as in 4096 bytes? But L1/L2/L3 cache is arranged in terms of 64 byte cache lines - where do pages come in here? At Bitcoin UTXO set scale, how large do you think these data structures are? > So at Bitcoin scale there will probably only be 3 cache misses for a > lookup, and that's a worst case scenario. The first one is probably always > warm, bringing it down to 2, and if you do a bunch in sorted order they'll > probably hit the same second level branches repeatedly bringing it down to > 1, and might even average less than that if there are enough that the leaf > block has multiple things being accessed. "Sorted order" - what exact type of sorting do you mean here? > > Anyway hashing is pretty slow. The very fast BLAKE2 is about 3 cycles/byte > > (SHA256 is about 15 cycles/byte) so hashing that same data would take > > around > > 200 cycles, and probably quite a bit more in practice due to overheads > > from our > > short message lengths; fetching a cache line from DRAM only takes about > > 1,000 > > cycles. I'd guess that once other overheads are taken into account, even > > if you > > could eliminate L2/L3 cache-misses it wouldn't be much of an improvement. > > > > Those numbers actually back up my claims about performance. If you're doing > a single update and recalculating the root afterwards, then the amount of > rehashing to be done is about 30 levels deep times 64 bytes per thing > hashed times 15 cycles per byte then it's about 28,800 cycles of hashing. > If you have a naive radix tree implementation which hits a cache miss at > every level then that's 30,000 cycles, which is about half the performance > time, certainly worth optimizing. If instead of sha256 you use blake2 > (Which sounds like a very good idea!) then hashing for an update will be > about 5760 cycles and performance will be completely dominated by cache > misses. If a more cache coherent implementation is used, then the cost of > cache misses will be 3000 cycles, which will be a non-factor with sha256 > and a significant but not dominating one with blake2. But that's assuming the dataset in question fits in cache; I don't see how it does. Since it doesn't, I'm argung the total % improvement by _any_ cache optimization on the subset that does fit in cache will be relatively small. Again, how large a dataset do you think you're working with here? -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-18 23:01 ` Peter Todd @ 2016-07-15 23:00 ` Bram Cohen 0 siblings, 0 replies; 11+ messages in thread From: Bram Cohen @ 2016-07-15 23:00 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 4273 bytes --] On Sat, Jun 18, 2016 at 4:01 PM, Peter Todd <pete@petertodd.org> wrote: > > Have you seen how BLAKE2 omits padding when the data to be hashed happens > to be > exactly one block in size? It's significantly faster than SHA256, and > that's a > standard part of the algorithm already. > That's very convenient! I didn't know it, but had 'look up how blake2 does padding' in my list of stuff to do. I'm leaning heavily towards using blake2b at this point, at least for internal hashing. > > > At the root there's a branch block. It consists of all nodes up to some > > fixed depth - let's say 12 - with that depth set so that it roughly fits > > within a single memory page. Branch blocks are arranged with the nodes in > > fixed position defined by the prefix they correspond to, and the > terminals > > have outpointers to other blocks. Because they're all clustered > together, a > > lookup or update will only require a single > > A single....? > Okay, I've figured out the root cause of general confusion here. It's mostly my fault. There are a few different media on which data can be stored, with different properties in terms of how long it takes to retrieve data from them, and how much of a readahead they typically have. I was misreading the l2 cache size as the main memory readahead amount, which is... probably wrong? The readahead properties of memory aren't well documented and apparently vary a lot. On SSDs it typically pulls down a kilobyte at once and they call them pages, hence my use of that term above. But since my real point is that my implementation should act as a silver bullet which can have acceptable performance even on extremely bad devices, I'll give an analysis of how well it works when everything is stored on a regular spinning hard drive. Let's say you're storing 100 million items, which will fit within 10 gigabytes. If you set the block depths to about 10 bits they'll be about 32K, and if you set the size of leaf blocks to be about the same then memory efficiency will be good because the leaf blocks will store twigs of about 2^7 in size while having 2^10 space so they'll fit reasonably. Almost everything will be three blocks from root, so updates will generally require two disk seeks (plus one more for a write but those are generally faster because they get batched.) For latency numbers, I'm going off these: https://gist.github.com/jboner/2841832 If the blockchain is very full of simple transactions and a disk seek takes 15 milliseconds, then going with the assumption that a block is about 600 seconds and the blockchain can handle 4 transactions per second and each of them is 3 updates (one utxo spent plus two new ones created) that's 15 * 600 * 4 * 3 * 2 milliseconds per block, or about 200 seconds per block, so if the uxto roots trail by a few blocks even a ludicrously underpowered node could keep up. On an SSD keeping up is completely trivial, the problem becomes one of how quickly you can validate an entire blockchain history. There a read takes about 0.15 milliseconds and you have to do 5 of them instead of 2 because the natural memory block size is 4k, so it's about 1 millisecond per update, or 600 * 4 * 3 total time for each block, which is about 7 seconds. That's large enough that making the utxo root trail by two blocks is still a good idea, but small enough that it isn't a big factor in the cost of running a node. It's enough that validating a complete block history might take a while though, and even validating just the last year would take a few days. This is very conservative and it's assuming that *everything* is kept on an SSD though. If the numbers work better and a few layers are kept in regular memory validating a whole year of history might only take a few hours. Hopefully that all makes a fairly good case that raw merkle tree utxo root trailing by a few blocks is a viable strategy. The data structures in the MMR proposal are fairly complicated and the analysis of them talks in somewhat vague terms about things being fast and slow. A similar analysis of the MMR proposal specifying storage media and expectations of latency numbers would clarify the reasoning a lot. (By the way, sorry for the slow response - I got preempted by a bunch of other work duties.) [-- Attachment #2: Type: text/html, Size: 5665 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-16 0:10 ` Peter Todd 2016-06-16 1:16 ` Bram Cohen @ 2016-06-18 3:22 ` Bram Cohen 2016-06-18 22:09 ` Peter Todd 1 sibling, 1 reply; 11+ messages in thread From: Bram Cohen @ 2016-06-18 3:22 UTC (permalink / raw) To: Peter Todd; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 2323 bytes --] On Wed, Jun 15, 2016 at 5:10 PM, Peter Todd <pete@petertodd.org> wrote: > On Tue, Jun 14, 2016 at 05:14:23PM -0700, Bram Cohen via bitcoin-dev wrote: > > > The fundamental approach to handling the latency problem is to have the > > utxo commitments trail a bit. Computing utxo commitments takes a certain > > amount of time, too much to hold up block propagation but still hopefully > > vastly less than the average amount of time between blocks. Trailing by a > > single block is probably a bad idea because you sometimes get blocks back > > to back, but you never get blocks back to back to back to back. Having > the > > utxo set be trailing by a fixed amount - five blocks is probably > excessive > > - would do a perfectly good job of keeping latency from every becoming an > > issue. Smaller commitments for the utxos added and removed in each block > > alone could be added without any significant performance penalty. That > way > > all blocks would have sufficient commitments for a completely up to date > > proofs of inclusion and exclusion. This is not a controversial approach. > > Agreed - regardless of approach adding latency to commitment calculations > of > all kinds is something I think we all agree can work in principle, although > obviously it should be a last resort technique when optimization fails. > An important point: Adding latency to utxo commitments does not imply latency to proofs of inclusion and exclusion! If roots of what's added and deleted in each block are added as well, then a proof of inclusion can be done by having a proof of inclusion of the trailing utxo set followed by a proof of exclusion from all the following deletion sets, or a proof of inclusion in one of the single block addition sets followed by proofs of exclusion from all the more recent deletion sets. Likewise a proof of exclusion can be a proof of exclusion from the utxo set followed by proofs of exclusion from all the more recent addition sets or a single proof of inclusion in a recent deletion set. This does make proofs larger (except in the case of recent deletions and maybe recent additions) and adds complexity, so it shouldn't be done unless necessary. But validation before block propagation needs to be extremely fast, so for utxo roots this trick is probably both necessary and sufficient. [-- Attachment #2: Type: text/html, Size: 2803 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [bitcoin-dev] Merkle trees and mountain ranges 2016-06-18 3:22 ` Bram Cohen @ 2016-06-18 22:09 ` Peter Todd 0 siblings, 0 replies; 11+ messages in thread From: Peter Todd @ 2016-06-18 22:09 UTC (permalink / raw) To: Bram Cohen; +Cc: Bitcoin Protocol Discussion [-- Attachment #1: Type: text/plain, Size: 3340 bytes --] On Fri, Jun 17, 2016 at 08:22:04PM -0700, Bram Cohen wrote: > On Wed, Jun 15, 2016 at 5:10 PM, Peter Todd <pete@petertodd.org> wrote: > > Agreed - regardless of approach adding latency to commitment calculations > > of > > all kinds is something I think we all agree can work in principle, although > > obviously it should be a last resort technique when optimization fails. > > > > An important point: Adding latency to utxo commitments does not imply > latency to proofs of inclusion and exclusion! If roots of what's added and > deleted in each block are added as well, then a proof of inclusion can be > done by having a proof of inclusion of the trailing utxo set followed by a > proof of exclusion from all the following deletion sets, or a proof of > inclusion in one of the single block addition sets followed by proofs of > exclusion from all the more recent deletion sets. Likewise a proof of > exclusion can be a proof of exclusion from the utxo set followed by proofs > of exclusion from all the more recent addition sets or a single proof of > inclusion in a recent deletion set. > > This does make proofs larger (except in the case of recent deletions and > maybe recent additions) and adds complexity, so it shouldn't be done unless > necessary. So, to be clear you're assuming that blocks commit to key:value maps of the block contents, specifically a pre-block "UTXO deletion/things that this block spent" set? First of all, it's interesting how the much smaller dataset of a pre-block key:value map would make L2/L3 caching optimizations much more likely to be relevant. :) That type of solution would be very similar to the solutions treechains would need to prove coins haven't been doublespent. Basically, in treechains the system as a whole is a datastructure indexed by time and prefix. So, if you want to prove a valid spend you need to convince me of three things: 1. The coin existed as of time t1 at prefix p 2. At t2, p, a valid spend was published. 3. Between t1 and t2 at prefix p no other valid spend was published. Paths to any prefix p as of time t, will have about log2(len(p)) size (beyond the top-level chain), similar to your above suggestion. Of course, unlike your above suggestion, in treechains it's not clear if step #1 can be done without another n*log(N)-ish sized proof in a truly trustless environment! > But validation before block propagation needs to be extremely > fast, so for utxo roots this trick is probably both necessary and > sufficient. I'm _not_ of the optinion that validation before propagation needs to be done at all - I think it's perfectly reasonable to propgate blocks that you have not validated at all (beyond checking PoW as an anti-DoS measure). The time it takes miners to start mining the next block - and collecting fees - is however very important. In practice, I think we're mostly in agreement here, but because I'm happy to propagate prior to validating I'd be ok with protocol designs that required miners to have relatively large amounts of RAM - say 32GB - dedicated to UTXO lookup because that wouldn't require relay nodes to also have those kinds of resources available to them once validationless propagation was implemented. -- https://petertodd.org 'peter'[:-1]@petertodd.org [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 455 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2016-07-15 23:01 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-06-15 0:14 [bitcoin-dev] Merkle trees and mountain ranges Bram Cohen 2016-06-16 0:10 ` Peter Todd 2016-06-16 1:16 ` Bram Cohen 2016-06-16 3:26 ` Peter Todd 2016-06-16 9:07 ` Bram Cohen 2016-06-17 4:34 ` Peter Todd 2016-06-18 2:43 ` Bram Cohen 2016-06-18 23:01 ` Peter Todd 2016-07-15 23:00 ` Bram Cohen 2016-06-18 3:22 ` Bram Cohen 2016-06-18 22:09 ` Peter Todd
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox