* [Bitcoin-development] Privacy and blockchain data
@ 2014-01-06 2:53 Peter Todd
2014-01-08 6:34 ` Jeremy Spilman
0 siblings, 1 reply; 3+ messages in thread
From: Peter Todd @ 2014-01-06 2:53 UTC (permalink / raw)
To: bitcoin-development
[-- Attachment #1: Type: text/plain, Size: 12997 bytes --]
* Summary
CoinJoin, CoinSwap and similar technologies improve your privacy by
making sure information about what coins you own doesn't make it into
the blockchain, but syncing your wallet is a privacy risk in itself and
can easily leak that same info. Here's an overview of that risk, how to
quantify it, and how to reduce it efficiently.
* Background
In the most general sense a Bitcoin wallet is a collection of one or
more scriptPubKeys, often known as addresses.(*) The basic purpose of
the wallet is maintain the set of all transaction outputs (txouts)
matching the scriptPubKeys in the wallet. Secondary to that purpose is
to maintain the set of all transactions associated with scriptPubKeys in
the wallet; almost all (all?) wallet software maintains transaction
information rather than only txout data. Usually, but not always, the
wallet will have some mechanism to spend transaction outputs, creating
new transactions. (if the wallet doesn't it is referred to as a
watch-only wallet)
Given a full set of blockchain data the task of keeping the set of all
relevant transactions and txouts up-to-date is simple: scan the
blockchain for the relevant data. The challenge is to devise systems
where wallets can be kept up to date without this requirement in a way
that is secure, efficient, scalable, and meets the user's privacy
requirements.
*) Alternatively addresses can be thought of as instructions to the
payor as to how to generate a scriptPubKey that the payee can spend,
a subtlety different concept.
* Threat Model and Goals
Currently the Bitcoin network consists of a large (low thousands) number
of allegedly independent nodes. There is no mechanism to prevent an
attacker from sybil attacking the network other than the availability of
IP addresses. This protection is made even weaker by the difficulty of
being sure you have a non-sybilled list of nodes to connect too; IP
addresses are passed gossip-style with no authentication.
From a privacy perspective we are conservative and assume an active,
internal, and global attacker - using the terminology of Diaz et al.(1)
- that controls up to 100% of the nodes you are connected too. With
regard to retrieval of blockchain data we can use the Sweeney's notion
of k-anonymity(2) where the privacy-sensitive data for an individual is
obscured by it's inclusion in a data of a large set of individuals, the
anonymity set.
* Basic Functionality
With regard to blockchain data we have the following basic functions:
** Spending funds
The user creates a transaction and gets it to miners by some method,
usually the P2P network although also possibly by direct submission.
Either way privacy can be achieved through a mix network such as Tor
and/or relaying other users' transactions so as to embed yours within a
larger anonymity set. In some cases payment protocols can shift the
problem to the recipient of the funds. Using CoinJoin also helps
increase the anonymity set.
Usually the sender will want to determine when the transaction confirms;
once the transaction has confirmed modulo a reorganization the
confirmation count can only increase. Transaction mutability and
double-spends by malicious CoinJoin participants complicate the task of
detecting confirmation: ideally we could simply query for the presence
of a given txid in each new block, however the transaction could be
mutated, changing the txid. The most simple way to detect confirmation
is then to query for spends of the txouts spend by the transaction.
** Receiving new funds
While in the future payment protocols may give recipients transaction
information directly it is most likely that wallets will continue to
have to query peers for new transactions paying scriptPubKey's under the
user's control for the forseeable future.
** Detection of unauthorized spends
Users' want early detection of private key compromise, accomplished by
querying blockchain data for spends from txouts in their wallets. This
has implications for how change must be handled, discussed below.
* Scalability/Efficiency
The total work done by the system as a whole for all queries given some
number of transactions n is the scalability of the scheme. In addition
scalability, and privacy in some cases, is improved if work can be
easily spread out across multiple nodes both at a per-block and
within-block level.
* Reliability/Robustness
Deterministic wallets using BIP32 or similar, where all private keys are
derived from a fixed seed, have proven to be extremely popular with
users for their simple backup model. While losing transaction metadata
after a data-loss event is unfortunate, losing access to all funds is a
disaster. Any address generation scheme must take this into account and
make it possible for all funds to be recovered quickly and efficiently
from blockchain data. Preserving privacy during this recovery is a
consideration, but 100% recovery of funds should not be sacrificed for
that goal.
* Query schemes
** Bloom filters
BIP37 bloom filters are currently implemented by the Bitcoin reference
implementation and used by bitcoinj-based SPV clients. Bloom filters
achieve a privacy-bandwidth tradeoff by having probabalistic
false-positives; the false-positives comprise the anonymity set.
Boom filters have a number of problems, both in the specific BIP37
implementation, as well as fundemental to the idea. Scalability is a
serious problem: the client sends asks a peer with a copy of all
blockchain data to filter data sent to the client, limiting the client's
bandwidth to only the data they are interested in. In the typical case
of a SPV wallet syncronizing against m new blocks this requires the peer
to read those m blocks from disk in their entirety, apply the filter,
and send the client the subset of matching transactions. Obviously this
results in poor O(n^2) scaling for n clients each making some fixed
number of transactions.
Of course bloom filters are attractive in that they have very good
performance per match, but this performance is only really relevant for
the most recent blockchain information where the data is in RAM. For
older information they make possible the Bloom IO attack where an
attacker uses an inordinant amount of disk IO bandwidth at little cost
to themselves.(3)
The actual BIP37 standard, and existing implementations of it, have a
number of other flaws that reduce privacy. For instance the standard
lets the seed value of the hash function be tweaked with a 32-bit
integer, nTweak. However on the one hand if randomly chosen and rarely
changed, as suggested by BIP37, the 32-bit integer can be used by an
attacker to correlate multiple connections from the same wallet. On the
other hand if nTweak is changed an attacker that can link multiple bloom
filters can AND those filters together to greatly decrease the
false-positive rate and determine exactly what funds are in the user's
wallet.
** Prefix filters
With a randomly distributed keyspace - common in cryptographic
applications - clients can query using variable length prefixes that
partially match the desired keys. A very simple format for a query of n
prefixes will look like the following:
<1 byte length in bits> <1 to 256/8 bytes of prefix>
...
...
0x00
The anonymity set is then the blockchain data whose key is the same
prefix, usually H(scriptPubKey) or scriptPubKey directly. An important
advantage of prefix filters is compatibility with the proposed (U)TXO
commitment schemes: the prefix maps directly to the committed
scriptPubKey lookup trees, and nodes simply return all entries matching
the prefix, as well the the rest of the merkle path to the blockchain
headers proving the data is valid.
While bloom filters have O(n) cost per lookup, or O(n^2) scalability
system-wide, prefix filters have significantly better O(log n) cost per
lookup, or O(n log n) system-wide. It's also worth noting that a naive
implementation can achieve very similar performance to bloom filters
without bothering to build key-value indexes by just scanning blockchain
data; once the data is hashed testing the hash against a prefix has a
minimal cost.
** Cryptographically blinded schemes
There are many blinded database query schemes in existence. While we do
not reject such schemes completely, technologies that rely on simple and
easy-to-understand cryptography have a significant advantage in their
simplicity. In addition such complex schemes are unlikely to ever be
made into miner commitments and thus are less trustworthy in the long
run.
* Correlation attacks
It is often advantageous if blockchain queries can be efficiently spread
across multiple servers to avoid allowing the attacker to correllate the
information into a whole. If you have n addresses that need to be
watched for new transactions, splitting the queries across m nodes
reduces the information any one node may learn. With bloom filters doing
this is extremely costly as the full blockchain data needs to be read
from disk to apply the filter; with prefix filters if the nodes have
appropriate indexes there is little overhead to splitting the queries
and no performance loss.
* DoS attacks
A possible DoS attack on bandwidth is to insert a large amount of
blockchain data matching the target's filter; the BIP37 nTweak parameter
was an attempt to avoid this problem, although with privacy tradeoffs.
Blockchain data is an extremely expensive communications channel so we
do not consider this a serious issue. Implementations may wish to give
clients the ability to specify a filter for information they do not want
to avoid unintentional collisions, although hopefully in the future the
address reuse making this a potential problem will become less common.
* Address use, management, and generation
If privacy was not a consideration the most efficient mode of operation
would be to use a single address, as is done by many existing wallets,
notably the bitcoinj-derived Multibit and Android Wallet, both of which
use bloom filters. In addition to strongly encouraging address re-use,
neither provide the user any control over the privacy/bandwidth tradeoff
given by bloom filters; the default settings have an extremely low
false-positive rate that is a significant privacy risk.
Taking privacy into account better clients such as Electrum, Armory, and
Bitcoin Core discourage the re-use of addresses in their UIs, and send
change to new addresses. However this leads to problem with user
expectations: users expect it to be possible to be notified quickly of
new transactions paying any address ever generated by their wallet, as
well as unauthorized spends from any txout, yet for privacy each query
for transactions related to the address/txout must match false-positives
that consume bandwidth; for a fixed bandwidth budget the specificity and
size of the filter must increase over time.
We have two main avenues to solve this problem:
1) Txin-reuse: Continue to reinforce the idea that transaction inputs
have no particular relationship to outputs. Using them for refunds or
other purposes implying "ownership" must be strongly discouraged.
CoinJoin will help here. If addresses associated with change txouts
are truly one-time-use, we can reduce or eliminate queries associated
with them. In particular, while the set of all change addresses ever
used will grow linearly with time, the set of all change addresses
with funds in them will remain roughly stable - it's this set that
needs to be queried to detect unauthorized spends.
2) Common prefixes: Generate addresses such that for a given wallet they
all share a fixed prefix. The length of that prefix determines the
anonymity set and associated privacy/bandwidth tradeoff, which
remainds a fixed ratio of all transactions for the life of the
wallet.
With this approach change addresses continue to be generated randomly, a
requirement for CoinJoin privacy. There is some statistical information
leaked if many non-change txouts are spent in a single transaction in a
CoinJoin, but even that leak can be avoided with the authors
OP_RETURN-based stealth addresses proposal. (to be published)
The actual prefix-forcing scheme in many cases will have to be
brute-force search; fortunately the search space involved is reasonably
small, ~2 to ~16 bits, and can be done as a background task.
1) Towards Measuring Anonymity, Claudia Diaz and Stefaan Seys and Joris
Claessens and Bart Preneel (April 2002)
2) k-Anonymity: A Model for Protecting Privacy, Latanya Sweeney, May
2002
3) Private discussions with developers.
--
'peter'[:-1]@petertodd.org
000000000000000f9102d27cfd61ea9e8bb324593593ca3ce6ba53153ff251b3
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 685 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Bitcoin-development] Privacy and blockchain data
2014-01-06 2:53 [Bitcoin-development] Privacy and blockchain data Peter Todd
@ 2014-01-08 6:34 ` Jeremy Spilman
2014-01-10 10:10 ` Peter Todd
0 siblings, 1 reply; 3+ messages in thread
From: Jeremy Spilman @ 2014-01-08 6:34 UTC (permalink / raw)
To: bitcoin-development, Peter Todd
>
> 2) Common prefixes: Generate addresses such that for a given wallet they
> all share a fixed prefix. The length of that prefix determines the
> anonymity set and associated privacy/bandwidth tradeoff, which
> remainds a fixed ratio of all transactions for the life of the
> wallet.
>
Interesting thought to make the privacy/bandwidth trade-off using
vanitygen and prefix filters.
But doesn't this effectively expand the universe of potential spies from
'the global attacker' who is watching your SPV queries, to simply 'the
globe' -- anyone with a copy of the blockchain?
Some stats on UTXO set size: (slightly stale -- as of block 270733)
7.4m unspent outputs
2.2m transactions with unspent outputs
2.1m unique unspent scriptPubKeys
Side note: the top 1,000 scriptPubKeys have 10% of all unspent outputs.
Let's say you use an 8-bit prefix (1/256) that would be ~10,000
transactions in the UTXO you would be monitoring. But if I knew a few
different days / time-periods you transacted, I could figure out your
prefix.
Of course, anyone you transact with would know your prefix outright.
Wouldn't this also allow obvious identification of spend versus change
addresses in a transaction?
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [Bitcoin-development] Privacy and blockchain data
2014-01-08 6:34 ` Jeremy Spilman
@ 2014-01-10 10:10 ` Peter Todd
0 siblings, 0 replies; 3+ messages in thread
From: Peter Todd @ 2014-01-10 10:10 UTC (permalink / raw)
To: Jeremy Spilman; +Cc: bitcoin-development
[-- Attachment #1: Type: text/plain, Size: 3843 bytes --]
On Tue, Jan 07, 2014 at 10:34:46PM -0800, Jeremy Spilman wrote:
> >
> >2) Common prefixes: Generate addresses such that for a given wallet they
> > all share a fixed prefix. The length of that prefix determines the
> > anonymity set and associated privacy/bandwidth tradeoff, which
> > remainds a fixed ratio of all transactions for the life of the
> > wallet.
> >
>
> Interesting thought to make the privacy/bandwidth trade-off using
> vanitygen and prefix filters.
>
> But doesn't this effectively expand the universe of potential spies
> from 'the global attacker' who is watching your SPV queries, to
> simply 'the globe' -- anyone with a copy of the blockchain?
It's a trade-off. Most people are going to use public peers for their
SPV nodes - they're not going to run full nodes. They also are going to
want to limit how much bandwidth they use to sync their wallets; if they
don't care the can use a very short, or no, prefix and the problem goes
away.
If you make that bandwidth/privacy trade-off by using very specific
filters and non-specific addresses then you have a situation where those
public peers are learning a lot of potentially valuable data. It's easy
to imagine, say, the IRS being willing to pay for data on how many
Bitcoins people have in their wallets to try to catch tax cheats for
instance, and that can easily fund a lot of fast and high-quality peers
that don't advertise the fact that they're selling data on the contents
of your wallet.
On the other hand if you use non-specific filters, and prefixed
addresses for incoming payments, then you're not leaking high-quality
information to anyone. I think this makes for a more robust Bitcoin
system, especially as we need things like CoinJoin for privacy that make
*everyones* privacy matter to you; CoinJoin could easily be defeated by
aquiring lots of good info on the contents of wallets through SPV
queries.
> Some stats on UTXO set size: (slightly stale -- as of block 270733)
>
> 7.4m unspent outputs
> 2.2m transactions with unspent outputs
> 2.1m unique unspent scriptPubKeys
> Side note: the top 1,000 scriptPubKeys have 10% of all unspent outputs.
>
> Let's say you use an 8-bit prefix (1/256) that would be ~10,000
> transactions in the UTXO you would be monitoring. But if I knew a
> few different days / time-periods you transacted, I could figure out
> your prefix.
Actually UTXO isn't the right way to look at this; prefix filters would
be almost certainly matched against all txouts in blocks. Or put another
way, UTXO isn't the right way to look at it because the attacker will
have some rough idea of the time period, and wants to know about
transactions made.
> Of course, anyone you transact with would know your prefix outright.
Well what good, in your example, is it for the attacker to go from "I
know my target gets a paycheck every two weeks for $x" to "His wallet
prefix is abcd with y% probability"? Even once you learn the prefix of
your target's wallet, what funds they actually own is still embedded in
a much larger anonymity set of hundreds to thousands of transactions
that had nothing to do with them.
> Wouldn't this also allow obvious identification of spend versus
> change addresses in a transaction?
No, I specifically said that you don't want to use prefixes with change
txouts for that reason. Fortunately while the set of all scriptPubKey's
ever used for change txouts will grow over time, as long as you are not
watching for new payments on any key in that set you only need to query
for the ones that still have funds on them, and that's only because you
want to be able to detect unauthorized spends of them.
--
'peter'[:-1]@petertodd.org
00000000000000028a5c9edabc9697fe96405f667be1d6d558d1db21d49b8857
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 685 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-01-10 10:11 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-06 2:53 [Bitcoin-development] Privacy and blockchain data Peter Todd
2014-01-08 6:34 ` Jeremy Spilman
2014-01-10 10:10 ` Peter Todd
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox