* [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set
@ 2018-05-16 16:36 Cory Fields
2018-05-17 15:28 ` Matt Corallo
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Cory Fields @ 2018-05-16 16:36 UTC (permalink / raw)
To: Bitcoin Dev
Tl;dr: Rather than storing all unspent outputs, store their hashes. Untrusted
peers can supply the full outputs when needed, with very little overhead.
Any attempt to spoof those outputs would be apparent, as their hashes would not
be present in the hash set. There are many advantages to this, most apparently
in disk and memory savings, as well as a validation speedup. The primary
disadvantage is a small increase in network traffic. I believe that the
advantages outweigh the disadvantages.
--
Bitcoin’s unspent transaction output set (usually referred to as “The UTXO
set”) has two primary roles: providing proof that previous outputs exist to be
spent, and providing the actual previous output data for verification when new
transactions attempts to spend them. These roles are not usually discussed
independently, but as Bram Cohen's TXO Bitfield [0] idea hints, there are
compelling reasons to consider them this way.
To see why, consider running a node with the following changes:
- For each new output, gather all extra data that will be needed for
verification when spending it later as an input: the amount, scriptPubKey,
creation height, coinbaseness, and output type (p2pkh, p2sh, p2wpkh, etc.).
Call this the Dereferenced Prevout data.
- Create a hash from the concatenation of the new outpoint and the dereferenced
prevout data. Call this a Unspent Transaction Output Hash.
- Rather than storing the full dereferenced prevout entries in a UTXO set as is
currently done, instead store their hashes to an Unspent Transaction Output
Hash Set, or UHS.
- When relaying a transaction, append the dereferenced prevout for each input.
Now when a transaction is received, it contains everything needed for
verification, including the input amount, height, and coinbaseness, which would
have otherwise required a lookup the UTXO set.
To verify an input's unspentness, again create a hash from the concatenation of
the referenced outpoint and the provided dereferenced prevout, and check for
its presence in the UHS. The hash will only be present if a hash of the exact
same data was previously added to (and not since removed from) the UHS. As
such, we are protected from a peer attempting to lie about the dereferenced
prevout data.
### Some benefits of the UHS model
- Requires no consensus changes, purely a p2p/implementation change.
- UHS is substantially smaller than a full UTXO set (just over half for the
main chain, see below). In-memory caching can be much more effective as a
result.
- A block’s transactions can be fully verified before doing a potentially
expensive database lookup for the previous output data. The UHS can be
queried afterwards (or in parallel) to verify previous output inclusion.
- Entire blocks could potentially be verified out-of-order because all input
data is provided; only the inclusion checks have to be in-order. Admittedly
this is likely too complicated to be realistic.
- pay-to-pubkey outputs are less burdensome on full nodes, since they use no
more space on-disk than pay-to-pubkey-hash or pay-to-script-hash. Taproot and
Graftroot outputs may share the same benefits.
- The burden of holding UTXO data is technically shifted from the verifiers to
the spender. In reality, full nodes will likely always have a copy as well,
but conceptually it's a slight improvement to the incentive model.
- Block data from peers can also be used to roll backwards during a reorg. This
potentially enables an even more aggressive pruning mode.
- UTXO storage size grows exactly linearly with UTXO count, as opposed to
growing linearly with UTXO data size. This may be relevant when considering
new larger output types which would otherwise cause the UTXO Set size to
increase more quickly.
- The UHS is a simple set, no need for a key-value database. LevelDB could
potentially be dropped as a dependency in some distant future.
- Potentially integrates nicely with Pieter Wuille's Rolling UTXO set hashes
[1]. Unspent Transaction Output Hashes would simply be mapped to points on a
curve before adding them to the set.
- With the help of inclusion proofs and rolling hashes, libbitcoinconsensus
could potentially safely verify entire blocks. The size of the required
proofs would be largely irrelevant as they would be consumed locally.
- Others?
### TxIn De-duplication
Setting aside the potential benefits, the obvious drawback of using a UHS is a
significant network traffic increase. Fortunately, some properties of
transactions can be exploited to offset most of the difference.
For quick reference:
p2pkh scriptPubKey: DUP HASH160 [pubkey hash] EQUALVERIFY CHECKSIG
p2pkh scriptSig: [signature] [pubkey]
p2sh scriptPubKey: HASH160 [script hash] EQUAL
p2sh scriptSig: [signature(s)] [script]
Notice that if a peer is sending a scriptPubKey and a scriptSig together, as
they would when using a UHS, there would likely be some redundancy. Using a
p2sh output for example, the scriptPubKey contains the script hash, and the
scriptSig contains the script itself. Therefore when sending dereferenced
prevouts over the wire, any hash which can be computed can be omitted and only
the preimages sent.
Non-standard output scripts must be sent in-full, though because they account
for only ~1% of all current UTXOs, they are rare enough to be ignored here.
### Intra-block Script De-duplication
When transactions are chained together in the same block, dereferenced prevout
data for these inputs would be redundant, as the full output data is already
present. For that reason, these dereferenced prevouts can be omitted when
sending over the wire.
The downside would be a new reconstruction pass requirement prior to
validation.
### Data
Here's some preliminary testing with a naive POC implementation patched into
Bitcoin Core. Note that the final sizes will depend on optimization of the
serialization format. The format used for these tests is suboptimal for sure.
Syncing mainnet to block 516346:
UTXO Node UHS Node
IBD Network Data: 153G 157G
Block disk space: 175G 157G
UTXO disk space : 2.8G 1.6G
Total disk space: 177.8G 158.6G
The on-disk block-space reduction comes from the elimination of the Undo data
that Bitcoin Core uses to roll back orphaned blocks. For UHS Nodes, this data
is merged into to the block data and de-duplicated.
Compared to the UXTO model, using a UHS reduces disk space by ~12%, yet only
requires ~5% more data over the wire.
Experimentation shows intra-block de-duplication to be of little help in
practice, as it only reduces overhead by ~0.2% on mainnet. It could become more
useful if, for example, CPFP usage increases substantially in the future.
### Safety
Assuming sha256 for the UHS's hash function, I don't believe any fundamental
changes to Bitcoin's security model are introduced. Because the unspent
transaction output hashes commit to all necessary data, including output types,
it should not be possible to be tricked into validating using mutated or forged
inputs.
### Compatibility
Transitioning from the current UTXO model would be annoying, but not
particularly painful. I'll briefly describe my current preferred approach, but
it makes sense to largely ignore this until there's been some discussion about
UHS in general.
A new service-bit should be allocated to indicate that a node is willing to
serve blocks and transactions with dereferenced prevout data appended. Once
connected to a node advertising this feature, nodes would use a new getdata
flag, creating MSG_PREVDATA_BLOCK and MSG_PREVDATA_TX.
Because current full nodes already have this data readily available in the
block-undo files, it is trivial to append on-the-fly. For that reason, it would
be easy to backport patches to the current stable version of Bitcoin Core in
order to enable serving these blocks even before they could be consumed. This
would avoid an awkward bootstrapping phase where there may only be a few nodes
available to serve to all new nodes.
Admittedly I haven't put much thought into the on-disk format, I'd rather leave
that to a database person. Though it does seem like a reasonable excuse to
consider moving away from LevelDB.
Wallets would begin holding full prevout data for their unspent outputs, though
they could probably back-into the data as-is.
### Serialization
I would prefer to delay this discussion until a more high-level discussion has
been had, otherwise this would be a magnet for nits. The format used to gather
the data above can be seen in the implementation below.
It should be noted, though, that the size of a UHS is directly dependent on the
chosen hash function. Smaller hashes could potentially be used, but I believe
that to be unwise.
### Drawbacks
The primary drawback of this approach is the ~5% network ovhead.
Additionally, there is the possibility that a few "bridge nodes" may be needed
for some time. In a future with a network of only pruned UHS nodes, an old
wallet with no knowledge of its dereferenced prevout data would need to
broadcast an old-style transaction, and have a bridge node append the extra
data before forwarding it along the network.
I won't speculate further there, except to say that I can't imagine a
transition problem that wouldn't have a straightforward solution.
Migration issues aside, am I missing any obvious drawbacks?
### Implementation
This code [2] was a quick hack-job, just enough to gather some initial data. It
builds a UHS in memory and never flushes to disk. Only a single run works,
nasty things will happen upon restart. It should only be viewed in order to get
an idea of what changes are needed. Only enough for IBD is implemented,
mempool/wallet/rpc are likely all broken. It is definitely not consensus-safe.
### Acknowledgement
I consider the UHS concept to be an evolution of Bram Cohen's TXO bitfield
idea. Bram also provided invaluable input when initially walking through the
feasibility of a UHS.
Pieter Wuille's work on Rolling UTXO set hashes served as a catalyst for
rethinking how the UTXO set may be represented.
Additional thanks to those at at Financial Crypto and the CoreDev event
afterwards who helped to flesh out the idea:
Tadge Dryja
Greg Sanders
John Newbery
Neha Narula
Jeremy Rubin
Jim Posen
...and everyone else who has chimed in.
[0] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-March/013928.html
[1] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-May/014337.html
[2] https://github.com/theuni/bitcoin/tree/utxo-set-hash3
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set
2018-05-16 16:36 [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set Cory Fields
@ 2018-05-17 15:28 ` Matt Corallo
2018-05-17 16:56 ` Gregory Maxwell
2018-05-18 15:42 ` Alex Mizrahi
2 siblings, 0 replies; 7+ messages in thread
From: Matt Corallo @ 2018-05-17 15:28 UTC (permalink / raw)
To: lists, Bitcoin Protocol Discussion
Hey Cory,
I'm generally a fan of having an option to "prove a block is valid when
relaying it" instead of "just relay it", but I am concerned that this
proposal is overfitting the current UTXO set. Specifically, because UTXO
entries are (roughly) 32 bytes per output plus 32 bytes per transaction
on disk today, a material increase in batching and many-output
transactions may significantly reduce the UTXO-set-size gain in this
proposal while adding complexity to block relay as well as increase the
size of block data relayed, which can have adverse effects on
propagation. I'd love to see your tests re-run on simulated transaction
data with more batching of sends.
Matt
On 05/16/18 12:36, Cory Fields via bitcoin-dev wrote:
> Tl;dr: Rather than storing all unspent outputs, store their hashes. Untrusted
> peers can supply the full outputs when needed, with very little overhead.
> Any attempt to spoof those outputs would be apparent, as their hashes would not
> be present in the hash set. There are many advantages to this, most apparently
> in disk and memory savings, as well as a validation speedup. The primary
> disadvantage is a small increase in network traffic. I believe that the
> advantages outweigh the disadvantages.
>
> --
>
> Bitcoin’s unspent transaction output set (usually referred to as “The UTXO
> set”) has two primary roles: providing proof that previous outputs exist to be
> spent, and providing the actual previous output data for verification when new
> transactions attempts to spend them. These roles are not usually discussed
> independently, but as Bram Cohen's TXO Bitfield [0] idea hints, there are
> compelling reasons to consider them this way.
>
> To see why, consider running a node with the following changes:
>
> - For each new output, gather all extra data that will be needed for
> verification when spending it later as an input: the amount, scriptPubKey,
> creation height, coinbaseness, and output type (p2pkh, p2sh, p2wpkh, etc.).
> Call this the Dereferenced Prevout data.
> - Create a hash from the concatenation of the new outpoint and the dereferenced
> prevout data. Call this a Unspent Transaction Output Hash.
> - Rather than storing the full dereferenced prevout entries in a UTXO set as is
> currently done, instead store their hashes to an Unspent Transaction Output
> Hash Set, or UHS.
> - When relaying a transaction, append the dereferenced prevout for each input.
>
> Now when a transaction is received, it contains everything needed for
> verification, including the input amount, height, and coinbaseness, which would
> have otherwise required a lookup the UTXO set.
>
> To verify an input's unspentness, again create a hash from the concatenation of
> the referenced outpoint and the provided dereferenced prevout, and check for
> its presence in the UHS. The hash will only be present if a hash of the exact
> same data was previously added to (and not since removed from) the UHS. As
> such, we are protected from a peer attempting to lie about the dereferenced
> prevout data.
>
> ### Some benefits of the UHS model
>
> - Requires no consensus changes, purely a p2p/implementation change.
>
> - UHS is substantially smaller than a full UTXO set (just over half for the
> main chain, see below). In-memory caching can be much more effective as a
> result.
>
> - A block’s transactions can be fully verified before doing a potentially
> expensive database lookup for the previous output data. The UHS can be
> queried afterwards (or in parallel) to verify previous output inclusion.
>
> - Entire blocks could potentially be verified out-of-order because all input
> data is provided; only the inclusion checks have to be in-order. Admittedly
> this is likely too complicated to be realistic.
>
> - pay-to-pubkey outputs are less burdensome on full nodes, since they use no
> more space on-disk than pay-to-pubkey-hash or pay-to-script-hash. Taproot and
> Graftroot outputs may share the same benefits.
>
> - The burden of holding UTXO data is technically shifted from the verifiers to
> the spender. In reality, full nodes will likely always have a copy as well,
> but conceptually it's a slight improvement to the incentive model.
>
> - Block data from peers can also be used to roll backwards during a reorg. This
> potentially enables an even more aggressive pruning mode.
>
> - UTXO storage size grows exactly linearly with UTXO count, as opposed to
> growing linearly with UTXO data size. This may be relevant when considering
> new larger output types which would otherwise cause the UTXO Set size to
> increase more quickly.
>
> - The UHS is a simple set, no need for a key-value database. LevelDB could
> potentially be dropped as a dependency in some distant future.
>
> - Potentially integrates nicely with Pieter Wuille's Rolling UTXO set hashes
> [1]. Unspent Transaction Output Hashes would simply be mapped to points on a
> curve before adding them to the set.
>
> - With the help of inclusion proofs and rolling hashes, libbitcoinconsensus
> could potentially safely verify entire blocks. The size of the required
> proofs would be largely irrelevant as they would be consumed locally.
>
> - Others?
>
> ### TxIn De-duplication
>
> Setting aside the potential benefits, the obvious drawback of using a UHS is a
> significant network traffic increase. Fortunately, some properties of
> transactions can be exploited to offset most of the difference.
>
> For quick reference:
>
> p2pkh scriptPubKey: DUP HASH160 [pubkey hash] EQUALVERIFY CHECKSIG
> p2pkh scriptSig: [signature] [pubkey]
>
> p2sh scriptPubKey: HASH160 [script hash] EQUAL
> p2sh scriptSig: [signature(s)] [script]
>
> Notice that if a peer is sending a scriptPubKey and a scriptSig together, as
> they would when using a UHS, there would likely be some redundancy. Using a
> p2sh output for example, the scriptPubKey contains the script hash, and the
> scriptSig contains the script itself. Therefore when sending dereferenced
> prevouts over the wire, any hash which can be computed can be omitted and only
> the preimages sent.
>
> Non-standard output scripts must be sent in-full, though because they account
> for only ~1% of all current UTXOs, they are rare enough to be ignored here.
>
> ### Intra-block Script De-duplication
>
> When transactions are chained together in the same block, dereferenced prevout
> data for these inputs would be redundant, as the full output data is already
> present. For that reason, these dereferenced prevouts can be omitted when
> sending over the wire.
>
> The downside would be a new reconstruction pass requirement prior to
> validation.
>
> ### Data
>
> Here's some preliminary testing with a naive POC implementation patched into
> Bitcoin Core. Note that the final sizes will depend on optimization of the
> serialization format. The format used for these tests is suboptimal for sure.
> Syncing mainnet to block 516346:
>
> UTXO Node UHS Node
> IBD Network Data: 153G 157G
> Block disk space: 175G 157G
> UTXO disk space : 2.8G 1.6G
> Total disk space: 177.8G 158.6G
>
> The on-disk block-space reduction comes from the elimination of the Undo data
> that Bitcoin Core uses to roll back orphaned blocks. For UHS Nodes, this data
> is merged into to the block data and de-duplicated.
>
> Compared to the UXTO model, using a UHS reduces disk space by ~12%, yet only
> requires ~5% more data over the wire.
>
> Experimentation shows intra-block de-duplication to be of little help in
> practice, as it only reduces overhead by ~0.2% on mainnet. It could become more
> useful if, for example, CPFP usage increases substantially in the future.
>
> ### Safety
>
> Assuming sha256 for the UHS's hash function, I don't believe any fundamental
> changes to Bitcoin's security model are introduced. Because the unspent
> transaction output hashes commit to all necessary data, including output types,
> it should not be possible to be tricked into validating using mutated or forged
> inputs.
>
> ### Compatibility
>
> Transitioning from the current UTXO model would be annoying, but not
> particularly painful. I'll briefly describe my current preferred approach, but
> it makes sense to largely ignore this until there's been some discussion about
> UHS in general.
>
> A new service-bit should be allocated to indicate that a node is willing to
> serve blocks and transactions with dereferenced prevout data appended. Once
> connected to a node advertising this feature, nodes would use a new getdata
> flag, creating MSG_PREVDATA_BLOCK and MSG_PREVDATA_TX.
>
> Because current full nodes already have this data readily available in the
> block-undo files, it is trivial to append on-the-fly. For that reason, it would
> be easy to backport patches to the current stable version of Bitcoin Core in
> order to enable serving these blocks even before they could be consumed. This
> would avoid an awkward bootstrapping phase where there may only be a few nodes
> available to serve to all new nodes.
>
> Admittedly I haven't put much thought into the on-disk format, I'd rather leave
> that to a database person. Though it does seem like a reasonable excuse to
> consider moving away from LevelDB.
>
> Wallets would begin holding full prevout data for their unspent outputs, though
> they could probably back-into the data as-is.
>
> ### Serialization
>
> I would prefer to delay this discussion until a more high-level discussion has
> been had, otherwise this would be a magnet for nits. The format used to gather
> the data above can be seen in the implementation below.
>
> It should be noted, though, that the size of a UHS is directly dependent on the
> chosen hash function. Smaller hashes could potentially be used, but I believe
> that to be unwise.
>
> ### Drawbacks
>
> The primary drawback of this approach is the ~5% network ovhead.
>
> Additionally, there is the possibility that a few "bridge nodes" may be needed
> for some time. In a future with a network of only pruned UHS nodes, an old
> wallet with no knowledge of its dereferenced prevout data would need to
> broadcast an old-style transaction, and have a bridge node append the extra
> data before forwarding it along the network.
>
> I won't speculate further there, except to say that I can't imagine a
> transition problem that wouldn't have a straightforward solution.
>
> Migration issues aside, am I missing any obvious drawbacks?
>
> ### Implementation
>
> This code [2] was a quick hack-job, just enough to gather some initial data. It
> builds a UHS in memory and never flushes to disk. Only a single run works,
> nasty things will happen upon restart. It should only be viewed in order to get
> an idea of what changes are needed. Only enough for IBD is implemented,
> mempool/wallet/rpc are likely all broken. It is definitely not consensus-safe.
>
> ### Acknowledgement
>
> I consider the UHS concept to be an evolution of Bram Cohen's TXO bitfield
> idea. Bram also provided invaluable input when initially walking through the
> feasibility of a UHS.
>
> Pieter Wuille's work on Rolling UTXO set hashes served as a catalyst for
> rethinking how the UTXO set may be represented.
>
> Additional thanks to those at at Financial Crypto and the CoreDev event
> afterwards who helped to flesh out the idea:
>
> Tadge Dryja
> Greg Sanders
> John Newbery
> Neha Narula
> Jeremy Rubin
> Jim Posen
> ...and everyone else who has chimed in.
>
>
> [0] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-March/013928.html
> [1] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/2017-May/014337.html
> [2] https://github.com/theuni/bitcoin/tree/utxo-set-hash3
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set
2018-05-16 16:36 [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set Cory Fields
2018-05-17 15:28 ` Matt Corallo
@ 2018-05-17 16:56 ` Gregory Maxwell
2018-05-17 17:16 ` Cory Fields
2018-06-07 9:39 ` Sjors Provoost
2018-05-18 15:42 ` Alex Mizrahi
2 siblings, 2 replies; 7+ messages in thread
From: Gregory Maxwell @ 2018-05-17 16:56 UTC (permalink / raw)
To: Cory Fields, Bitcoin Protocol Discussion
On Wed, May 16, 2018 at 4:36 PM, Cory Fields via bitcoin-dev
<bitcoin-dev@lists.linuxfoundation.org> wrote:
> Tl;dr: Rather than storing all unspent outputs, store their hashes.
My initial thoughts are it's not _completely_ obvious to me that a 5%
ongoing bandwidth increase is actually a win to get something like a
40% reduction in the size of a pruned node (and less than a 1%
reduction in an archive node) primarily because I've not seen size of
a pruned node cited as a usage limiting factor basically anywhere. I
would assume it is a win but wouldn't be shocked to see a careful
analysis that concluded it wasn't.
But perhaps more interestingly, I think the overhead is not really 5%,
but it's 5% measured in the context of the phenomenally inefficient tx
mechanisms ( https://bitcointalk.org/index.php?topic=1377345.0 ).
Napkin math on the size of a txn alone tells me it's more like a 25%
increase if you just consider size of tx vs size of
tx+scriptpubkeys,amounts. If I'm not missing something there, I think
that would get in into a very clear not-win range.
On the positive side is that it doesn't change the blockchain
datastructure, so it's something implementations could do without
marrying the network to it forever.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set
2018-05-17 16:56 ` Gregory Maxwell
@ 2018-05-17 17:16 ` Cory Fields
2018-06-07 9:39 ` Sjors Provoost
1 sibling, 0 replies; 7+ messages in thread
From: Cory Fields @ 2018-05-17 17:16 UTC (permalink / raw)
To: Gregory Maxwell; +Cc: Bitcoin Dev
[-- Attachment #1: Type: text/plain, Size: 1914 bytes --]
Matt: That's a good point. I'll do up a chart comparing utxo size at each
block, as well as comparing over-the-wire size for each block. I think the
period of coalescing earlier this year should be a good example of what
you're describing.
Greg: heh, I was wondering how long it would take for someone to point out
that I'm cheating. I avoided using the word "compression", mostly to
side-step having the discussion turning into reworking the wire
serialization. But you're absolutely right, the de-duping benefits are
independent of the UHS use-case.
Cory
On Thu, May 17, 2018, 12:56 PM Gregory Maxwell <greg@xiph.org> wrote:
> On Wed, May 16, 2018 at 4:36 PM, Cory Fields via bitcoin-dev
> <bitcoin-dev@lists.linuxfoundation.org> wrote:
> > Tl;dr: Rather than storing all unspent outputs, store their hashes.
>
> My initial thoughts are it's not _completely_ obvious to me that a 5%
> ongoing bandwidth increase is actually a win to get something like a
> 40% reduction in the size of a pruned node (and less than a 1%
> reduction in an archive node) primarily because I've not seen size of
> a pruned node cited as a usage limiting factor basically anywhere. I
> would assume it is a win but wouldn't be shocked to see a careful
> analysis that concluded it wasn't.
>
> But perhaps more interestingly, I think the overhead is not really 5%,
> but it's 5% measured in the context of the phenomenally inefficient tx
> mechanisms ( https://bitcointalk.org/index.php?topic=1377345.0 ).
> Napkin math on the size of a txn alone tells me it's more like a 25%
> increase if you just consider size of tx vs size of
> tx+scriptpubkeys,amounts. If I'm not missing something there, I think
> that would get in into a very clear not-win range.
>
> On the positive side is that it doesn't change the blockchain
> datastructure, so it's something implementations could do without
> marrying the network to it forever.
>
[-- Attachment #2: Type: text/html, Size: 2643 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set
2018-05-16 16:36 [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set Cory Fields
2018-05-17 15:28 ` Matt Corallo
2018-05-17 16:56 ` Gregory Maxwell
@ 2018-05-18 15:42 ` Alex Mizrahi
2 siblings, 0 replies; 7+ messages in thread
From: Alex Mizrahi @ 2018-05-18 15:42 UTC (permalink / raw)
To: lists, Bitcoin Protocol Discussion
[-- Attachment #1: Type: text/plain, Size: 11847 bytes --]
You should read this:
https://bitcointalk.org/index.php?topic=153662.10
On Wed, May 16, 2018 at 7:36 PM, Cory Fields via bitcoin-dev <
bitcoin-dev@lists.linuxfoundation.org> wrote:
> Tl;dr: Rather than storing all unspent outputs, store their hashes.
> Untrusted
> peers can supply the full outputs when needed, with very little overhead.
> Any attempt to spoof those outputs would be apparent, as their hashes
> would not
> be present in the hash set. There are many advantages to this, most
> apparently
> in disk and memory savings, as well as a validation speedup. The primary
> disadvantage is a small increase in network traffic. I believe that the
> advantages outweigh the disadvantages.
>
> --
>
> Bitcoin’s unspent transaction output set (usually referred to as “The UTXO
> set”) has two primary roles: providing proof that previous outputs exist
> to be
> spent, and providing the actual previous output data for verification when
> new
> transactions attempts to spend them. These roles are not usually discussed
> independently, but as Bram Cohen's TXO Bitfield [0] idea hints, there are
> compelling reasons to consider them this way.
>
> To see why, consider running a node with the following changes:
>
> - For each new output, gather all extra data that will be needed for
> verification when spending it later as an input: the amount,
> scriptPubKey,
> creation height, coinbaseness, and output type (p2pkh, p2sh, p2wpkh,
> etc.).
> Call this the Dereferenced Prevout data.
> - Create a hash from the concatenation of the new outpoint and the
> dereferenced
> prevout data. Call this a Unspent Transaction Output Hash.
> - Rather than storing the full dereferenced prevout entries in a UTXO set
> as is
> currently done, instead store their hashes to an Unspent Transaction
> Output
> Hash Set, or UHS.
> - When relaying a transaction, append the dereferenced prevout for each
> input.
>
> Now when a transaction is received, it contains everything needed for
> verification, including the input amount, height, and coinbaseness, which
> would
> have otherwise required a lookup the UTXO set.
>
> To verify an input's unspentness, again create a hash from the
> concatenation of
> the referenced outpoint and the provided dereferenced prevout, and check
> for
> its presence in the UHS. The hash will only be present if a hash of the
> exact
> same data was previously added to (and not since removed from) the UHS. As
> such, we are protected from a peer attempting to lie about the dereferenced
> prevout data.
>
> ### Some benefits of the UHS model
>
> - Requires no consensus changes, purely a p2p/implementation change.
>
> - UHS is substantially smaller than a full UTXO set (just over half for the
> main chain, see below). In-memory caching can be much more effective as a
> result.
>
> - A block’s transactions can be fully verified before doing a potentially
> expensive database lookup for the previous output data. The UHS can be
> queried afterwards (or in parallel) to verify previous output inclusion.
>
> - Entire blocks could potentially be verified out-of-order because all
> input
> data is provided; only the inclusion checks have to be in-order.
> Admittedly
> this is likely too complicated to be realistic.
>
> - pay-to-pubkey outputs are less burdensome on full nodes, since they use
> no
> more space on-disk than pay-to-pubkey-hash or pay-to-script-hash.
> Taproot and
> Graftroot outputs may share the same benefits.
>
> - The burden of holding UTXO data is technically shifted from the
> verifiers to
> the spender. In reality, full nodes will likely always have a copy as
> well,
> but conceptually it's a slight improvement to the incentive model.
>
> - Block data from peers can also be used to roll backwards during a reorg.
> This
> potentially enables an even more aggressive pruning mode.
>
> - UTXO storage size grows exactly linearly with UTXO count, as opposed to
> growing linearly with UTXO data size. This may be relevant when
> considering
> new larger output types which would otherwise cause the UTXO Set size to
> increase more quickly.
>
> - The UHS is a simple set, no need for a key-value database. LevelDB could
> potentially be dropped as a dependency in some distant future.
>
> - Potentially integrates nicely with Pieter Wuille's Rolling UTXO set
> hashes
> [1]. Unspent Transaction Output Hashes would simply be mapped to points
> on a
> curve before adding them to the set.
>
> - With the help of inclusion proofs and rolling hashes, libbitcoinconsensus
> could potentially safely verify entire blocks. The size of the required
> proofs would be largely irrelevant as they would be consumed locally.
>
> - Others?
>
> ### TxIn De-duplication
>
> Setting aside the potential benefits, the obvious drawback of using a UHS
> is a
> significant network traffic increase. Fortunately, some properties of
> transactions can be exploited to offset most of the difference.
>
> For quick reference:
>
> p2pkh scriptPubKey: DUP HASH160 [pubkey hash] EQUALVERIFY CHECKSIG
> p2pkh scriptSig: [signature] [pubkey]
>
> p2sh scriptPubKey: HASH160 [script hash] EQUAL
> p2sh scriptSig: [signature(s)] [script]
>
> Notice that if a peer is sending a scriptPubKey and a scriptSig together,
> as
> they would when using a UHS, there would likely be some redundancy. Using a
> p2sh output for example, the scriptPubKey contains the script hash, and the
> scriptSig contains the script itself. Therefore when sending dereferenced
> prevouts over the wire, any hash which can be computed can be omitted and
> only
> the preimages sent.
>
> Non-standard output scripts must be sent in-full, though because they
> account
> for only ~1% of all current UTXOs, they are rare enough to be ignored here.
>
> ### Intra-block Script De-duplication
>
> When transactions are chained together in the same block, dereferenced
> prevout
> data for these inputs would be redundant, as the full output data is
> already
> present. For that reason, these dereferenced prevouts can be omitted when
> sending over the wire.
>
> The downside would be a new reconstruction pass requirement prior to
> validation.
>
> ### Data
>
> Here's some preliminary testing with a naive POC implementation patched
> into
> Bitcoin Core. Note that the final sizes will depend on optimization of the
> serialization format. The format used for these tests is suboptimal for
> sure.
> Syncing mainnet to block 516346:
>
> UTXO Node UHS Node
> IBD Network Data: 153G 157G
> Block disk space: 175G 157G
> UTXO disk space : 2.8G 1.6G
> Total disk space: 177.8G 158.6G
>
> The on-disk block-space reduction comes from the elimination of the Undo
> data
> that Bitcoin Core uses to roll back orphaned blocks. For UHS Nodes, this
> data
> is merged into to the block data and de-duplicated.
>
> Compared to the UXTO model, using a UHS reduces disk space by ~12%, yet
> only
> requires ~5% more data over the wire.
>
> Experimentation shows intra-block de-duplication to be of little help in
> practice, as it only reduces overhead by ~0.2% on mainnet. It could become
> more
> useful if, for example, CPFP usage increases substantially in the future.
>
> ### Safety
>
> Assuming sha256 for the UHS's hash function, I don't believe any
> fundamental
> changes to Bitcoin's security model are introduced. Because the unspent
> transaction output hashes commit to all necessary data, including output
> types,
> it should not be possible to be tricked into validating using mutated or
> forged
> inputs.
>
> ### Compatibility
>
> Transitioning from the current UTXO model would be annoying, but not
> particularly painful. I'll briefly describe my current preferred approach,
> but
> it makes sense to largely ignore this until there's been some discussion
> about
> UHS in general.
>
> A new service-bit should be allocated to indicate that a node is willing to
> serve blocks and transactions with dereferenced prevout data appended. Once
> connected to a node advertising this feature, nodes would use a new getdata
> flag, creating MSG_PREVDATA_BLOCK and MSG_PREVDATA_TX.
>
> Because current full nodes already have this data readily available in the
> block-undo files, it is trivial to append on-the-fly. For that reason, it
> would
> be easy to backport patches to the current stable version of Bitcoin Core
> in
> order to enable serving these blocks even before they could be consumed.
> This
> would avoid an awkward bootstrapping phase where there may only be a few
> nodes
> available to serve to all new nodes.
>
> Admittedly I haven't put much thought into the on-disk format, I'd rather
> leave
> that to a database person. Though it does seem like a reasonable excuse to
> consider moving away from LevelDB.
>
> Wallets would begin holding full prevout data for their unspent outputs,
> though
> they could probably back-into the data as-is.
>
> ### Serialization
>
> I would prefer to delay this discussion until a more high-level discussion
> has
> been had, otherwise this would be a magnet for nits. The format used to
> gather
> the data above can be seen in the implementation below.
>
> It should be noted, though, that the size of a UHS is directly dependent
> on the
> chosen hash function. Smaller hashes could potentially be used, but I
> believe
> that to be unwise.
>
> ### Drawbacks
>
> The primary drawback of this approach is the ~5% network ovhead.
>
> Additionally, there is the possibility that a few "bridge nodes" may be
> needed
> for some time. In a future with a network of only pruned UHS nodes, an old
> wallet with no knowledge of its dereferenced prevout data would need to
> broadcast an old-style transaction, and have a bridge node append the extra
> data before forwarding it along the network.
>
> I won't speculate further there, except to say that I can't imagine a
> transition problem that wouldn't have a straightforward solution.
>
> Migration issues aside, am I missing any obvious drawbacks?
>
> ### Implementation
>
> This code [2] was a quick hack-job, just enough to gather some initial
> data. It
> builds a UHS in memory and never flushes to disk. Only a single run works,
> nasty things will happen upon restart. It should only be viewed in order
> to get
> an idea of what changes are needed. Only enough for IBD is implemented,
> mempool/wallet/rpc are likely all broken. It is definitely not
> consensus-safe.
>
> ### Acknowledgement
>
> I consider the UHS concept to be an evolution of Bram Cohen's TXO bitfield
> idea. Bram also provided invaluable input when initially walking through
> the
> feasibility of a UHS.
>
> Pieter Wuille's work on Rolling UTXO set hashes served as a catalyst for
> rethinking how the UTXO set may be represented.
>
> Additional thanks to those at at Financial Crypto and the CoreDev event
> afterwards who helped to flesh out the idea:
>
> Tadge Dryja
> Greg Sanders
> John Newbery
> Neha Narula
> Jeremy Rubin
> Jim Posen
> ...and everyone else who has chimed in.
>
>
> [0] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/
> 2017-March/013928.html
> [1] https://lists.linuxfoundation.org/pipermail/bitcoin-dev/
> 2017-May/014337.html
> [2] https://github.com/theuni/bitcoin/tree/utxo-set-hash3
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>
[-- Attachment #2: Type: text/html, Size: 13309 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set
2018-05-17 16:56 ` Gregory Maxwell
2018-05-17 17:16 ` Cory Fields
@ 2018-06-07 9:39 ` Sjors Provoost
2018-06-10 23:07 ` Jim Posen
1 sibling, 1 reply; 7+ messages in thread
From: Sjors Provoost @ 2018-06-07 9:39 UTC (permalink / raw)
To: Bitcoin Protocol Discussion, Gregory Maxwell
eMMC storage, which low end devices often use, come in 2x increments. Running a pruned full node on 8 GB is difficult if not impossible (the UTXO set peaked at 3.5 GB in January, but a full node stores additional stuff).
However, 16 GB is only €10 more expensive and presumably standard by the time this would be rolled out.
On AWS every GB of SSD storage avoided saves $1 per year, not end of the world stuff, but not negligible either. Outbound traffic costs $0.10 / GB (ignoring free allowance), so when uploading 200 GB per year, the 5% would offset $1 of storage cost savings.
The above seems marginal, probably not worth it unless there’s really no downside.
What I find attractive about this proposal is the ability to squeeze more out of limited RAM (typically only 1 or 2 GB on these low end devices). I’d have to test Cory’s branch to see if that actually matters in practice.
It’s also useful to distinguish benefits during initial sync from ongoing operation. The former I’ve almost given up on for low end devices (can take weeks), in favor of doing it on a faster computer and copying the result. The latter needs far less RAM, so perhaps this proposal doesn’t help much there, but that would be useful to measure.
Did you try the recent SHA256 optimizations on your branch?
Sjors
> Op 17 mei 2018, om 18:56 heeft Gregory Maxwell via bitcoin-dev <bitcoin-dev@lists.linuxfoundation.org> het volgende geschreven:
>
> On Wed, May 16, 2018 at 4:36 PM, Cory Fields via bitcoin-dev
> <bitcoin-dev@lists.linuxfoundation.org> wrote:
>> Tl;dr: Rather than storing all unspent outputs, store their hashes.
>
> My initial thoughts are it's not _completely_ obvious to me that a 5%
> ongoing bandwidth increase is actually a win to get something like a
> 40% reduction in the size of a pruned node (and less than a 1%
> reduction in an archive node) primarily because I've not seen size of
> a pruned node cited as a usage limiting factor basically anywhere. I
> would assume it is a win but wouldn't be shocked to see a careful
> analysis that concluded it wasn't.
>
> But perhaps more interestingly, I think the overhead is not really 5%,
> but it's 5% measured in the context of the phenomenally inefficient tx
> mechanisms ( https://bitcointalk.org/index.php?topic=1377345.0 ).
> Napkin math on the size of a txn alone tells me it's more like a 25%
> increase if you just consider size of tx vs size of
> tx+scriptpubkeys,amounts. If I'm not missing something there, I think
> that would get in into a very clear not-win range.
>
> On the positive side is that it doesn't change the blockchain
> datastructure, so it's something implementations could do without
> marrying the network to it forever.
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set
2018-06-07 9:39 ` Sjors Provoost
@ 2018-06-10 23:07 ` Jim Posen
0 siblings, 0 replies; 7+ messages in thread
From: Jim Posen @ 2018-06-10 23:07 UTC (permalink / raw)
To: Bitcoin Protocol Discussion
[-- Attachment #1: Type: text/plain, Size: 4225 bytes --]
I generally like the direction of this proposal in terms of allowing full
nodes to run with a different storage/bandwidth tradeoff. Cory, were this
implemented, would you expect Core to support both operating modes (full
UTXO set and UHS) depending on user configuration, or would UHS be
mandatory?
Also, given that Bram Cohen's TXO bitfield proposal was an inspiration for
this, could you comment on why the UHS is preferable to that approach? An
alternative that goes even further in the direction of more bandwidth, less
storage, would be for nodes to simply maintain a Merkle Mountain Range over
all TXOs in order of creation and a spentness bitfield. Blocks could be
requested with the prev outputs and a Merkle proof linking them into the
MMR root. Since the Merkle proof is deterministic, it could be computed by
archive nodes and miners and saved alongside the block data for relay.
Another benefit of this is the TXO MMR root may be independently useful if
committed into the coinbase transaction.
On Thu, Jun 7, 2018 at 7:02 AM Sjors Provoost via bitcoin-dev <
bitcoin-dev@lists.linuxfoundation.org> wrote:
> eMMC storage, which low end devices often use, come in 2x increments.
> Running a pruned full node on 8 GB is difficult if not impossible (the UTXO
> set peaked at 3.5 GB in January, but a full node stores additional stuff).
>
> However, 16 GB is only €10 more expensive and presumably standard by the
> time this would be rolled out.
>
> On AWS every GB of SSD storage avoided saves $1 per year, not end of the
> world stuff, but not negligible either. Outbound traffic costs $0.10 / GB
> (ignoring free allowance), so when uploading 200 GB per year, the 5% would
> offset $1 of storage cost savings.
>
> The above seems marginal, probably not worth it unless there’s really no
> downside.
>
> What I find attractive about this proposal is the ability to squeeze more
> out of limited RAM (typically only 1 or 2 GB on these low end devices). I’d
> have to test Cory’s branch to see if that actually matters in practice.
>
> It’s also useful to distinguish benefits during initial sync from ongoing
> operation. The former I’ve almost given up on for low end devices (can
> take weeks), in favor of doing it on a faster computer and copying the
> result. The latter needs far less RAM, so perhaps this proposal doesn’t
> help much there, but that would be useful to measure.
>
> Did you try the recent SHA256 optimizations on your branch?
>
> Sjors
>
> > Op 17 mei 2018, om 18:56 heeft Gregory Maxwell via bitcoin-dev <
> bitcoin-dev@lists.linuxfoundation.org> het volgende geschreven:
> >
> > On Wed, May 16, 2018 at 4:36 PM, Cory Fields via bitcoin-dev
> > <bitcoin-dev@lists.linuxfoundation.org> wrote:
> >> Tl;dr: Rather than storing all unspent outputs, store their hashes.
> >
> > My initial thoughts are it's not _completely_ obvious to me that a 5%
> > ongoing bandwidth increase is actually a win to get something like a
> > 40% reduction in the size of a pruned node (and less than a 1%
> > reduction in an archive node) primarily because I've not seen size of
> > a pruned node cited as a usage limiting factor basically anywhere. I
> > would assume it is a win but wouldn't be shocked to see a careful
> > analysis that concluded it wasn't.
> >
> > But perhaps more interestingly, I think the overhead is not really 5%,
> > but it's 5% measured in the context of the phenomenally inefficient tx
> > mechanisms ( https://bitcointalk.org/index.php?topic=1377345.0 ).
> > Napkin math on the size of a txn alone tells me it's more like a 25%
> > increase if you just consider size of tx vs size of
> > tx+scriptpubkeys,amounts. If I'm not missing something there, I think
> > that would get in into a very clear not-win range.
> >
> > On the positive side is that it doesn't change the blockchain
> > datastructure, so it's something implementations could do without
> > marrying the network to it forever.
> >
>
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>
[-- Attachment #2: Type: text/html, Size: 5116 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2018-06-10 23:07 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-16 16:36 [bitcoin-dev] UHS: Full-node security without maintaining a full UTXO set Cory Fields
2018-05-17 15:28 ` Matt Corallo
2018-05-17 16:56 ` Gregory Maxwell
2018-05-17 17:16 ` Cory Fields
2018-06-07 9:39 ` Sjors Provoost
2018-06-10 23:07 ` Jim Posen
2018-05-18 15:42 ` Alex Mizrahi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox