public inbox for bitcoindev@googlegroups.com
 help / color / mirror / Atom feed
* [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
@ 2015-11-30 23:12 Peter Tschipper
  2015-12-01  5:28 ` Matt Corallo
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Tschipper @ 2015-11-30 23:12 UTC (permalink / raw)
  To: gmaxwell, Bitcoin Dev


@gmaxwell Bip Editor, and the Bitcoin Dev Community,

After several weeks of experimenting and testing with various
compression libraries I think there is enough evidence to show that
compressing blocks and transactions is not only beneficial in reducing
network bandwidth but is also provides a small performance boost when
there is latency on the network.

The following is a BIP Draft document for your review. 
(The alignment of the columns in the tables doesn't come out looking
right in this email but if you cut and paste into a text document they
are just fine)


<pre>
  BIP: ?
  Title: Datastream compression of Blocks and Tx's
  Author: Peter Tschipper <peter.tschipper@gmail.com>
  Status: Draft
  Type: Standards Track
  Created: 2015-11-30
</pre>

==Abstract==

To compress blocks and transactions, and to concatenate them together
when possible, before sending.

==Motivation==

Bandwidth is an issue for users that run nodes in regions where
bandwidth is expensive and subject to caps, in addition network latency
in some regions can also be quite high. By compressing data we can
reduce daily bandwidth used in a significant way while at the same time
speed up the transmission of data throughout the network. This should
encourage users to keep their nodes running longer and allow for more
peer connections with less need for bandwidth throttling and in
addition, may also encourage users in areas of marginal internet
connectivity to run nodes where in the past they would not have been
able to.

==Specification==

Advertise compression using a service bit.  Both peers must have
compression turned on in order for data to be compressed, sent, and
decompressed.

Blocks will be sent compressed.

Transactions will be sent compressed with the exception of those less
than 500 bytes.

Blocks will be concatenated when possible.

Transactions will be concatenated when possible or when a
MSG_FILTERED_BLOCK is requested.

Compression levels to be specified in "bitcoin.conf".

Compression and decompression can be completely turned off.

Although unlikely, if compression should fail then data will be sent
uncompressed.

The code for compressing and decompressing will be located in class
CDataStream.

Compression library LZO1x will be used.

==Rationale==

By using a service bit, compression and decompression can be turned
on/off completely at both ends with a simple configuration setting. It
is important to be able to easily turn off compression/decompression as
a fall back mechanism.  Using a service bit also makes the code fully
compatible with any node that does not currently support compression. A
node that do not present the correct service bit will simply receive
data in standard uncompressed format.

All blocks will be compressed. Even small blocks have been found to
benefit from compression.
 
Multiple block requests that are in queue will be concatenated together
when possible to increase compressibility of smaller blocks.
Concatenation will happen only if there are multiple block requests from
the same remote peer.  For example, if peer1 is requesting two blocks
and they are both in queue then those two blocks will be concatenated.
However, if peer1 is requesting 1 block and peer2 also one block, and
they are both in queue, then each peer is sent only its block and no
concatenation will occur. Up to 16 blocks (the max blocks in flight) can
be concatenated but not exceeding the MAX_PROTOCOL_MESSAGE_LENGTH.
Concatenated blocks compress better and further reduce bandwidth.

Transactions below 500 bytes do not compress well and will be sent
uncompressed unless they can be concatenated (see Table 3).

Multiple transaction requests that are in queue will be concatenated
when possible.  This further reduces bandwidth needs and speeds the
transfer of large requests for many transactions, such as with
MSG_FILTERED_BLOCK requests, or when the system gets busy and is flooded
with transactions.  Concatenation happens in the same way as for blocks,
described above.

By allowing for differing compression levels which can be specified in
the bitcoin.conf file, a node operator can tailor their compression to a
level suitable for their system.

Although unlikely, if compression fails for any reason then blocks and
transactions will be sent uncompressed.  Therefore, even with
compression turned on, a node will be able to handle both compressed and
uncompressed data from another peer.

By Abstracting the compression/decompression code into class
"CDataStream", compression can be easily applied to any datastream.

The compression library LZO1x-1 does not compress to the extent that
Zlib does but it is clearly the better performer (particularly as file
sizes get larger), while at the same time providing very good
compression (see Tables 1 and 2).  Furthermore, LZO1x-999 can provide
and almost Zlib like compression for those who wish to have more
compression, although at a cost.

==Test Results==

With the LZO library, current test results show up to a 20% compression
using LZO1x-1 and up to 27% when using LZO1x-999.  In addition there is
a marked performance improvement when there is latency on the network.
From the test results, with a latency of 60ms there is an almost 30%
improvement in performance when comparing LZO1x-1 compressed blocks with
uncompressed blocks (see Table 5).

The following table shows the percentage that blocks were compressed,
using two different Zlib and LZO1x compression level settings.

TABLE 1:
range = data size range
range           Zlib-1  Zlib-6  LZO1x-1 LZO1x-999
-----------     ------  ------  ------- --------
0-250           12.44   12.86   10.79   14.34
250-500         19.33   12.97   10.34   11.11   
600-700         16.72   n/a     12.91   17.25
700-800         6.37    7.65    4.83    8.07
900-1KB         6.54    6.95    5.64    7.9
1KB-10KB        25.08   25.65   21.21   22.65
10KB-100KB      19.77   21.57   4.37    19.02
100KB-200KB     21.49   23.56   15.37   21.55
200KB-300KB     23.66   24.18   16.91   22.76
300KB-400KB     23.4    23.7    16.5    21.38
400KB-500KB     24.6    24.85   17.56   22.43
500KB-600KB     25.51   26.55   18.51   23.4
600KB-700KB     27.25   28.41   19.91   25.46
700KB-800KB     27.58   29.18   20.26   27.17
800KB-900KB     27      29.11   20      27.4
900KB-1MB       28.19   29.38   21.15   26.43
1MB -2MB        27.41   29.46   21.33   27.73

The following table shows the time in seconds that a block of data takes
to compress using different compression levels.  One can clearly see
that LZO1x-1 is the fastest and is not as affected when data sizes get
larger.

TABLE 2:
range = data size range
range           Zlib-1  Zlib-6  LZO1x-1 LZO1x-999
-----------     ------  ------  ------- ---------
0-250           0.001   0       0       0
250-500         0       0       0       0.001
500-1KB         0       0       0       0.001
1KB-10KB        0.001   0.001   0       0.002
10KB-100KB      0.004   0.006   0.001   0.017
100KB-200KB     0.012   0.017   0.002   0.054
200KB-300KB     0.018   0.024   0.003   0.087
300KB-400KB     0.022   0.03    0.003   0.121
400KB-500KB     0.027   0.037   0.004   0.151
500KB-600KB     0.031   0.044   0.004   0.184
600KB-700KB     0.035   0.051   0.006   0.211
700KB-800KB     0.039   0.057   0.006   0.243
800KB-900KB     0.045   0.064   0.006   0.27
900KB-1MB       0.049   0.072   0.006   0.307

TABLE 3:
Compression of Transactions (without concatenation)
range = block size range
ubytes = average size of uncompressed transactions
cbytes = average size of compressed transactions
cmp% = the percentage amount that the transaction was compressed
datapoints = number of datapoints taken

range       ubytes    cbytes    cmp%    datapoints
----------  ------    ------    ------  ----------    
0-250       220       227       -3.16   23780
250-500     356       354       0.68    20882
500-600     534       505       5.29    2772
600-700     653       608       6.95    1853
700-800     757       649       14.22   578
800-900     822       758       7.77    661
900-1KB     954       862       9.69    906
1KB-10KB    2698      2222      17.64   3370
10KB-100KB  15463     12092     21.80   15429

The above table shows that transactions don't compress well below 500
bytes but do very well beyond 1KB where there are a great deal of those
large spam type transactions.   However, most transactions happen to be
in the < 500 byte range.  So the next step was to appy concatenation for
those smaller transactions.  Doing that yielded some very good
compression results.  Some examples as follows:

The best one that was seen was when 175 transactions were concatenated
before being compressed.  That yielded a 20% compression ratio, but that
doesn't take into account the savings from the unneeded 174 message
headers (24 bytes each) as well as 174 TCP ACKs of 52 bytes each which
yields and additional 76*174 = 13224 byte savings, making for an overall
bandwidth savings of 32%:

     2015-11-18 01:09:09.002061 compressed data from 79890 to 67426
txcount:175

However, that was an extreme example.  Most transaction aggregates were
in the 2 to 10 transaction range.  Such as the following:

     2015-11-17 21:08:28.469313 compressed data from 3199 to 2876 txcount:10

But even here the savings of 10% was far better than the "nothing" we
would get without concatenation, but add to that the 76 byte * 9
transaction savings and we have a total 20% savings in bandwidth for
transactions that otherwise would not be compressible.  Therefore the
concatenation of small transactions can also save bandwidth and speed up
the transmission of those transactions through the network while keeping
network and message queue chatter to a minimum.

==Choice of Compression library==

LZO was chosen over Zlib.  LZO is the fastest most scalable option when
used at the lowest compression setting which will be a performance boost
for users that prefer performance over bandwidth savings. And at the
higher end, LZO provides good compression (although at a higher cost)
which approaches that of Zlib.

Other compression libraries investigated were Snappy, LZOf, fastZlib and
LZ4 however none of these were found to be suitable, either because they
were not portable, lacked the flexibility to set compression levels or
did not provide a useful compression ratio.

The following two tables show results in seconds for syncing the first
200,000 blocks. Tests were run on a high-speed wireless LAN with very
little latency, and also run with a 60ms latency which was induced with
"Netbalancer".
               
TABLE 4:
Results shown in seconds on highspeed wireless LAN (no induced latency)
Num blks sync'd  Uncmp  Zlib-1  Zlib-6  LZO1x-1  LZO1x-999
---------------  -----  ------  ------  -------  ---------
10000            255    232     233     231      257      
20000            464    414     420     407      453      
30000            677    594     611     585      650      
40000            887    787     795     760      849     
50000            1099   961     977     933      1048   
60000            1310   1145    1167    1110     1259  
70000            1512   1330    1362    1291     1470  
80000            1714   1519    1552    1469     1679   
90000            1917   1707    1747    1650     1882  
100000           2122   1905    1950    1843     2111    
110000           2333   2107    2151    2038     2329  
120000           2560   2333    2376    2256     2580   
130000           2835   2656    2679    2558     2921 
140000           3274   3259    3161    3051     3466   
150000           3662   3793    3547    3440     3919   
160000           4040   4172    3937    3767     4416   
170000           4425   4625    4379    4215     4958   
180000           4860   5149    4895    4781     5560    
190000           5855   6160    5898    5805     6557    
200000           7004   7234    7051    6983     7770   

TABLE 5:
Results shown in seconds with 60ms of induced latency
Num blks sync'd  Uncmp  Zlib-1  Zlib-6  LZO1x-1  LZO1x-999
---------------  -----  ------  ------  -------  ---------
10000            219    299     296     294      291
20000            432    568     565     558      548
30000            652    835     836     819      811
40000            866    1106    1107    1081     1071
50000            1082   1372    1381    1341     1333
60000            1309   1644    1654    1605     1600
70000            1535   1917    1936    1873     1875
80000            1762   2191    2210    2141     2141
90000            1992   2463    2486    2411     2411
100000           2257   2748    2780    2694     2697
110000           2627   3034    3076    2970     2983
120000           3226   3416    3397    3266     3302
130000           4010   3983    3773    3625     3703
140000           4914   4503    4292    4127     4287
150000           5806   4928    4719    4529     4821
160000           6674   5249    5164    4840     5314
170000           7563   5603    5669    5289     6002
180000           8477   6054    6268    5858     6638
190000           9843   7085    7278    6868     7679
200000           11338  8215    8433    8044     8795

==Backward compatibility==

Being unable to present the correct service bit, older clients will
continue to receive standard uncompressed data and will be fully
compatible with this change.

==Fallback==

It is important to be able to entirely and easily turn off compression
and decompression as a fall back mechanism. This can be done with a
simple bitcoin.conf setting of "compressionlevel=0". Only one of the two
connected peers need to set compressionlevel=0 in order to turn off
compression and decompression completely.

==Deployment==

This enhancement does not require a hard or soft fork.

==Service Bit==

During the testing of this implementation, service bit 28 was used,
however this enhancement will require a permanently assigned service bit.

==Implementation==

This implementation depends on the LZO compression library: lzo-2.09

     https://github.com/ptschip/bitcoin/tree/compress

==Copyright==

This document is placed in the public domain.




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-11-30 23:12 [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions Peter Tschipper
@ 2015-12-01  5:28 ` Matt Corallo
  2015-12-01 20:06   ` Pavel Janík
                     ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Matt Corallo @ 2015-12-01  5:28 UTC (permalink / raw)
  To: Peter Tschipper, Bitcoin Dev

I'm really not a fan of this at all. To start with, adding a compression library that is directly accessible to the network on financial software is a really, really scary idea. If there were a massive improvement, I'd find it acceptable, but the improvement you've shown really isn't all that much. The numbers you recently posted show it improving the very beginning of IBD somewhat over high-latency connections, but if we're throughput-limited after the very beginning of IBD, we should fix that, not compress the blocks. Additionally, I'd be very surprised if this had any significant effect on the speed at which new blocks traverse the network (do you have any simulations or other thoughts on this?).

All that said, I'd love a proposal that allows clients to download compressed blocks via an external daemon, especially during IBD. This could help people with very restrictive data caps do IBD instead of being pushed to revert to SPV. Additionally, I think we need more chain sync protocols so that the current P2P protocol isn't consensus-critical anymore.

On November 30, 2015 4:12:24 PM MST, Peter Tschipper via bitcoin-dev <bitcoin-dev@lists.linuxfoundation.org> wrote:
>
>@gmaxwell Bip Editor, and the Bitcoin Dev Community,
>
>After several weeks of experimenting and testing with various
>compression libraries I think there is enough evidence to show that
>compressing blocks and transactions is not only beneficial in reducing
>network bandwidth but is also provides a small performance boost when
>there is latency on the network.
>
>The following is a BIP Draft document for your review. 
>(The alignment of the columns in the tables doesn't come out looking
>right in this email but if you cut and paste into a text document they
>are just fine)
>
>
><pre>
>  BIP: ?
>  Title: Datastream compression of Blocks and Tx's
>  Author: Peter Tschipper <peter.tschipper@gmail.com>
>  Status: Draft
>  Type: Standards Track
>  Created: 2015-11-30
></pre>
>
>==Abstract==
>
>To compress blocks and transactions, and to concatenate them together
>when possible, before sending.
>
>==Motivation==
>
>Bandwidth is an issue for users that run nodes in regions where
>bandwidth is expensive and subject to caps, in addition network latency
>in some regions can also be quite high. By compressing data we can
>reduce daily bandwidth used in a significant way while at the same time
>speed up the transmission of data throughout the network. This should
>encourage users to keep their nodes running longer and allow for more
>peer connections with less need for bandwidth throttling and in
>addition, may also encourage users in areas of marginal internet
>connectivity to run nodes where in the past they would not have been
>able to.
>
>==Specification==
>
>Advertise compression using a service bit.  Both peers must have
>compression turned on in order for data to be compressed, sent, and
>decompressed.
>
>Blocks will be sent compressed.
>
>Transactions will be sent compressed with the exception of those less
>than 500 bytes.
>
>Blocks will be concatenated when possible.
>
>Transactions will be concatenated when possible or when a
>MSG_FILTERED_BLOCK is requested.
>
>Compression levels to be specified in "bitcoin.conf".
>
>Compression and decompression can be completely turned off.
>
>Although unlikely, if compression should fail then data will be sent
>uncompressed.
>
>The code for compressing and decompressing will be located in class
>CDataStream.
>
>Compression library LZO1x will be used.
>
>==Rationale==
>
>By using a service bit, compression and decompression can be turned
>on/off completely at both ends with a simple configuration setting. It
>is important to be able to easily turn off compression/decompression as
>a fall back mechanism.  Using a service bit also makes the code fully
>compatible with any node that does not currently support compression. A
>node that do not present the correct service bit will simply receive
>data in standard uncompressed format.
>
>All blocks will be compressed. Even small blocks have been found to
>benefit from compression.
> 
>Multiple block requests that are in queue will be concatenated together
>when possible to increase compressibility of smaller blocks.
>Concatenation will happen only if there are multiple block requests
>from
>the same remote peer.  For example, if peer1 is requesting two blocks
>and they are both in queue then those two blocks will be concatenated.
>However, if peer1 is requesting 1 block and peer2 also one block, and
>they are both in queue, then each peer is sent only its block and no
>concatenation will occur. Up to 16 blocks (the max blocks in flight)
>can
>be concatenated but not exceeding the MAX_PROTOCOL_MESSAGE_LENGTH.
>Concatenated blocks compress better and further reduce bandwidth.
>
>Transactions below 500 bytes do not compress well and will be sent
>uncompressed unless they can be concatenated (see Table 3).
>
>Multiple transaction requests that are in queue will be concatenated
>when possible.  This further reduces bandwidth needs and speeds the
>transfer of large requests for many transactions, such as with
>MSG_FILTERED_BLOCK requests, or when the system gets busy and is
>flooded
>with transactions.  Concatenation happens in the same way as for
>blocks,
>described above.
>
>By allowing for differing compression levels which can be specified in
>the bitcoin.conf file, a node operator can tailor their compression to
>a
>level suitable for their system.
>
>Although unlikely, if compression fails for any reason then blocks and
>transactions will be sent uncompressed.  Therefore, even with
>compression turned on, a node will be able to handle both compressed
>and
>uncompressed data from another peer.
>
>By Abstracting the compression/decompression code into class
>"CDataStream", compression can be easily applied to any datastream.
>
>The compression library LZO1x-1 does not compress to the extent that
>Zlib does but it is clearly the better performer (particularly as file
>sizes get larger), while at the same time providing very good
>compression (see Tables 1 and 2).  Furthermore, LZO1x-999 can provide
>and almost Zlib like compression for those who wish to have more
>compression, although at a cost.
>
>==Test Results==
>
>With the LZO library, current test results show up to a 20% compression
>using LZO1x-1 and up to 27% when using LZO1x-999.  In addition there is
>a marked performance improvement when there is latency on the network.
From the test results, with a latency of 60ms there is an almost 30%
>improvement in performance when comparing LZO1x-1 compressed blocks
>with
>uncompressed blocks (see Table 5).
>
>The following table shows the percentage that blocks were compressed,
>using two different Zlib and LZO1x compression level settings.
>
>TABLE 1:
>range = data size range
>range           Zlib-1  Zlib-6  LZO1x-1 LZO1x-999
>-----------     ------  ------  ------- --------
>0-250           12.44   12.86   10.79   14.34
>250-500         19.33   12.97   10.34   11.11   
>600-700         16.72   n/a     12.91   17.25
>700-800         6.37    7.65    4.83    8.07
>900-1KB         6.54    6.95    5.64    7.9
>1KB-10KB        25.08   25.65   21.21   22.65
>10KB-100KB      19.77   21.57   4.37    19.02
>100KB-200KB     21.49   23.56   15.37   21.55
>200KB-300KB     23.66   24.18   16.91   22.76
>300KB-400KB     23.4    23.7    16.5    21.38
>400KB-500KB     24.6    24.85   17.56   22.43
>500KB-600KB     25.51   26.55   18.51   23.4
>600KB-700KB     27.25   28.41   19.91   25.46
>700KB-800KB     27.58   29.18   20.26   27.17
>800KB-900KB     27      29.11   20      27.4
>900KB-1MB       28.19   29.38   21.15   26.43
>1MB -2MB        27.41   29.46   21.33   27.73
>
>The following table shows the time in seconds that a block of data
>takes
>to compress using different compression levels.  One can clearly see
>that LZO1x-1 is the fastest and is not as affected when data sizes get
>larger.
>
>TABLE 2:
>range = data size range
>range           Zlib-1  Zlib-6  LZO1x-1 LZO1x-999
>-----------     ------  ------  ------- ---------
>0-250           0.001   0       0       0
>250-500         0       0       0       0.001
>500-1KB         0       0       0       0.001
>1KB-10KB        0.001   0.001   0       0.002
>10KB-100KB      0.004   0.006   0.001   0.017
>100KB-200KB     0.012   0.017   0.002   0.054
>200KB-300KB     0.018   0.024   0.003   0.087
>300KB-400KB     0.022   0.03    0.003   0.121
>400KB-500KB     0.027   0.037   0.004   0.151
>500KB-600KB     0.031   0.044   0.004   0.184
>600KB-700KB     0.035   0.051   0.006   0.211
>700KB-800KB     0.039   0.057   0.006   0.243
>800KB-900KB     0.045   0.064   0.006   0.27
>900KB-1MB       0.049   0.072   0.006   0.307
>
>TABLE 3:
>Compression of Transactions (without concatenation)
>range = block size range
>ubytes = average size of uncompressed transactions
>cbytes = average size of compressed transactions
>cmp% = the percentage amount that the transaction was compressed
>datapoints = number of datapoints taken
>
>range       ubytes    cbytes    cmp%    datapoints
>----------  ------    ------    ------  ----------    
>0-250       220       227       -3.16   23780
>250-500     356       354       0.68    20882
>500-600     534       505       5.29    2772
>600-700     653       608       6.95    1853
>700-800     757       649       14.22   578
>800-900     822       758       7.77    661
>900-1KB     954       862       9.69    906
>1KB-10KB    2698      2222      17.64   3370
>10KB-100KB  15463     12092     21.80   15429
>
>The above table shows that transactions don't compress well below 500
>bytes but do very well beyond 1KB where there are a great deal of those
>large spam type transactions.   However, most transactions happen to be
>in the < 500 byte range.  So the next step was to appy concatenation
>for
>those smaller transactions.  Doing that yielded some very good
>compression results.  Some examples as follows:
>
>The best one that was seen was when 175 transactions were concatenated
>before being compressed.  That yielded a 20% compression ratio, but
>that
>doesn't take into account the savings from the unneeded 174 message
>headers (24 bytes each) as well as 174 TCP ACKs of 52 bytes each which
>yields and additional 76*174 = 13224 byte savings, making for an
>overall
>bandwidth savings of 32%:
>
>     2015-11-18 01:09:09.002061 compressed data from 79890 to 67426
>txcount:175
>
>However, that was an extreme example.  Most transaction aggregates were
>in the 2 to 10 transaction range.  Such as the following:
>
>2015-11-17 21:08:28.469313 compressed data from 3199 to 2876 txcount:10
>
>But even here the savings of 10% was far better than the "nothing" we
>would get without concatenation, but add to that the 76 byte * 9
>transaction savings and we have a total 20% savings in bandwidth for
>transactions that otherwise would not be compressible.  Therefore the
>concatenation of small transactions can also save bandwidth and speed
>up
>the transmission of those transactions through the network while
>keeping
>network and message queue chatter to a minimum.
>
>==Choice of Compression library==
>
>LZO was chosen over Zlib.  LZO is the fastest most scalable option when
>used at the lowest compression setting which will be a performance
>boost
>for users that prefer performance over bandwidth savings. And at the
>higher end, LZO provides good compression (although at a higher cost)
>which approaches that of Zlib.
>
>Other compression libraries investigated were Snappy, LZOf, fastZlib
>and
>LZ4 however none of these were found to be suitable, either because
>they
>were not portable, lacked the flexibility to set compression levels or
>did not provide a useful compression ratio.
>
>The following two tables show results in seconds for syncing the first
>200,000 blocks. Tests were run on a high-speed wireless LAN with very
>little latency, and also run with a 60ms latency which was induced with
>"Netbalancer".
>               
>TABLE 4:
>Results shown in seconds on highspeed wireless LAN (no induced latency)
>Num blks sync'd  Uncmp  Zlib-1  Zlib-6  LZO1x-1  LZO1x-999
>---------------  -----  ------  ------  -------  ---------
>10000            255    232     233     231      257      
>20000            464    414     420     407      453      
>30000            677    594     611     585      650      
>40000            887    787     795     760      849     
>50000            1099   961     977     933      1048   
>60000            1310   1145    1167    1110     1259  
>70000            1512   1330    1362    1291     1470  
>80000            1714   1519    1552    1469     1679   
>90000            1917   1707    1747    1650     1882  
>100000           2122   1905    1950    1843     2111    
>110000           2333   2107    2151    2038     2329  
>120000           2560   2333    2376    2256     2580   
>130000           2835   2656    2679    2558     2921 
>140000           3274   3259    3161    3051     3466   
>150000           3662   3793    3547    3440     3919   
>160000           4040   4172    3937    3767     4416   
>170000           4425   4625    4379    4215     4958   
>180000           4860   5149    4895    4781     5560    
>190000           5855   6160    5898    5805     6557    
>200000           7004   7234    7051    6983     7770   
>
>TABLE 5:
>Results shown in seconds with 60ms of induced latency
>Num blks sync'd  Uncmp  Zlib-1  Zlib-6  LZO1x-1  LZO1x-999
>---------------  -----  ------  ------  -------  ---------
>10000            219    299     296     294      291
>20000            432    568     565     558      548
>30000            652    835     836     819      811
>40000            866    1106    1107    1081     1071
>50000            1082   1372    1381    1341     1333
>60000            1309   1644    1654    1605     1600
>70000            1535   1917    1936    1873     1875
>80000            1762   2191    2210    2141     2141
>90000            1992   2463    2486    2411     2411
>100000           2257   2748    2780    2694     2697
>110000           2627   3034    3076    2970     2983
>120000           3226   3416    3397    3266     3302
>130000           4010   3983    3773    3625     3703
>140000           4914   4503    4292    4127     4287
>150000           5806   4928    4719    4529     4821
>160000           6674   5249    5164    4840     5314
>170000           7563   5603    5669    5289     6002
>180000           8477   6054    6268    5858     6638
>190000           9843   7085    7278    6868     7679
>200000           11338  8215    8433    8044     8795
>
>==Backward compatibility==
>
>Being unable to present the correct service bit, older clients will
>continue to receive standard uncompressed data and will be fully
>compatible with this change.
>
>==Fallback==
>
>It is important to be able to entirely and easily turn off compression
>and decompression as a fall back mechanism. This can be done with a
>simple bitcoin.conf setting of "compressionlevel=0". Only one of the
>two
>connected peers need to set compressionlevel=0 in order to turn off
>compression and decompression completely.
>
>==Deployment==
>
>This enhancement does not require a hard or soft fork.
>
>==Service Bit==
>
>During the testing of this implementation, service bit 28 was used,
>however this enhancement will require a permanently assigned service
>bit.
>
>==Implementation==
>
>This implementation depends on the LZO compression library: lzo-2.09
>
>     https://github.com/ptschip/bitcoin/tree/compress
>
>==Copyright==
>
>This document is placed in the public domain.
>
>
>_______________________________________________
>bitcoin-dev mailing list
>bitcoin-dev@lists.linuxfoundation.org
>https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-01  5:28 ` Matt Corallo
@ 2015-12-01 20:06   ` Pavel Janík
       [not found]     ` <565E30C6.1010002@bitcartel.com>
  2015-12-02 18:57   ` Emin Gün Sirer
  2015-12-02 23:05   ` Peter Tschipper
  2 siblings, 1 reply; 15+ messages in thread
From: Pavel Janík @ 2015-12-01 20:06 UTC (permalink / raw)
  To: Bitcoin Dev


> On 01 Dec 2015, at 06:28, Matt Corallo via bitcoin-dev <bitcoin-dev@lists.linuxfoundation.org> wrote:
> 
> I'm really not a fan of this at all. To start with, adding a compression library that is directly accessible to the network on financial software is a really, really scary idea.

I have the same opinion.

On the other hand, I can imagine using compression on local blocks storage (be it compressed filesystem, or compression in the user space/in the application - compare with https://github.com/bitcoin/bitcoin/issues/2278). Now that we support pruning and obfuscating, this could be another option. Saving ~20% can be interesting in some usecases.
--  
Pavel Janík






^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
       [not found]     ` <565E30C6.1010002@bitcartel.com>
@ 2015-12-02  6:47       ` Pavel Janík
  2015-12-02  7:33         ` Simon Liu
  0 siblings, 1 reply; 15+ messages in thread
From: Pavel Janík @ 2015-12-02  6:47 UTC (permalink / raw)
  To: Simon Liu; +Cc: Bitcoin Dev


> On 02 Dec 2015, at 00:44, Simon Liu <simon@bitcartel.com> wrote:
> 
> Hi Matt/Pavel,
> 
> Why is it scary/undesirable?  Thanks.

Select your preferable compression library and google for it with +CVE.

E.g. in zlib:

http://www.cvedetails.com/vulnerability-list/vendor_id-72/product_id-1820/GNU-Zlib.html

…allows remote attackers to cause a denial of service (crash) via a crafted compressed stream…
…allows remote attackers to cause a denial of service (application crash)…
etc.

Do you want to expose such lib to the potential attacker?
--  
Pavel Janík






^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-02  6:47       ` Pavel Janík
@ 2015-12-02  7:33         ` Simon Liu
  2015-12-02 18:45           ` Patrick Strateman
  0 siblings, 1 reply; 15+ messages in thread
From: Simon Liu @ 2015-12-02  7:33 UTC (permalink / raw)
  To: Pavel Janík; +Cc: Bitcoin Dev

Hi Pavel,

(my earlier email was moderated, so the list can only see it via your
reply),

Yes, an attacker could try and send malicious data to take advantage of
a compression library vulnerability...  but is it that much worse than
existing attack vectors which might also result in denial of service,
crashes, remote execution?

Peter, perhaps your BIP can look at possible ways to isolate the
decompression phase, such as having incoming compressed blocks be saved
to a quarantine folder and an external process/daemon decompress and
verify the block's hash?

Regards,
Simon


On 12/01/2015 10:47 PM, Pavel Janík wrote:
> 
>> On 02 Dec 2015, at 00:44, Simon Liu <simon@bitcartel.com> wrote:
>>
>> Hi Matt/Pavel,
>>
>> Why is it scary/undesirable?  Thanks.
> 
> Select your preferable compression library and google for it with +CVE.
> 
> E.g. in zlib:
> 
> http://www.cvedetails.com/vulnerability-list/vendor_id-72/product_id-1820/GNU-Zlib.html
> 
> …allows remote attackers to cause a denial of service (crash) via a crafted compressed stream…
> …allows remote attackers to cause a denial of service (application crash)…
> etc.
> 
> Do you want to expose such lib to the potential attacker?
> --  
> Pavel Janík
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-02  7:33         ` Simon Liu
@ 2015-12-02 18:45           ` Patrick Strateman
  0 siblings, 0 replies; 15+ messages in thread
From: Patrick Strateman @ 2015-12-02 18:45 UTC (permalink / raw)
  To: bitcoin-dev

If compression is to be used a custom compression algorithm should be
written.

Bitcoin data is largely incompressible outside of a tiny subset of fields.

On 12/01/2015 11:33 PM, Simon Liu via bitcoin-dev wrote:
> Hi Pavel,
>
> (my earlier email was moderated, so the list can only see it via your
> reply),
>
> Yes, an attacker could try and send malicious data to take advantage of
> a compression library vulnerability...  but is it that much worse than
> existing attack vectors which might also result in denial of service,
> crashes, remote execution?
>
> Peter, perhaps your BIP can look at possible ways to isolate the
> decompression phase, such as having incoming compressed blocks be saved
> to a quarantine folder and an external process/daemon decompress and
> verify the block's hash?
>
> Regards,
> Simon
>
>
> On 12/01/2015 10:47 PM, Pavel Janík wrote:
>>> On 02 Dec 2015, at 00:44, Simon Liu <simon@bitcartel.com> wrote:
>>>
>>> Hi Matt/Pavel,
>>>
>>> Why is it scary/undesirable?  Thanks.
>> Select your preferable compression library and google for it with +CVE.
>>
>> E.g. in zlib:
>>
>> http://www.cvedetails.com/vulnerability-list/vendor_id-72/product_id-1820/GNU-Zlib.html
>>
>> …allows remote attackers to cause a denial of service (crash) via a crafted compressed stream…
>> …allows remote attackers to cause a denial of service (application crash)…
>> etc.
>>
>> Do you want to expose such lib to the potential attacker?
>> --  
>> Pavel Janík
>>
>>
>>
>>
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-01  5:28 ` Matt Corallo
  2015-12-01 20:06   ` Pavel Janík
@ 2015-12-02 18:57   ` Emin Gün Sirer
  2015-12-02 20:16     ` Peter Tschipper
  2015-12-03 19:14     ` Gavin Andresen
  2015-12-02 23:05   ` Peter Tschipper
  2 siblings, 2 replies; 15+ messages in thread
From: Emin Gün Sirer @ 2015-12-02 18:57 UTC (permalink / raw)
  To: Matt Corallo; +Cc: Bitcoin Dev

[-- Attachment #1: Type: text/plain, Size: 4747 bytes --]

Thanks Peter for the careful, quantitative work.

I want to bring one additional issue to everyone's consideration, related
to the choice of the Lempel-Ziv family of compressors.

While I'm not familiar with every single compression engine tested, the
Lempel-Ziv family of compressors are generally based on "compression
tables." Essentially, they assign a short unique number to every new
subsequence they encounter, and when they re-encounter a sequence like "ab"
in "abcdfdcdabcdfabcdf" they replace it with that short integer (say, in
this case, 9-bit constant 256). So this example sequence may turn into
"abcdfd<258 for cd><256 for ab><258 for cd>f<261 for abc><259 for df>"
which is slightly shorter than the original (I'm doing this off the top of
my head so the counts may be off, but it's meant to be illustrative). Note
that the sequence "abc" got added into the table only after it was
encountered twice in the input.

This is nice and generic and works well for English text where certain
letter sequences (e.g. "it" "th" "the" "this" "are" "there" etc) are
repeated often, but it is nowhere as compact as it could possibly be for
mostly binary data -- there are opportunities for much better compression,
made possible by the structured reuse of certain byte sequences in the
Bitcoin wire protocol.

On a Bitcoin wire connection, we might see several related transactions
reorganizing cash in a set of addresses, and therefore, several reuses of a
20-byte address. Or we might see a 200-byte transaction get transmitted,
followed by the same transaction, repeated in a block. Ideally, we'd learn
the sequence that may be repeated later on, all at once (e.g. a Bitcoin
address or a transaction), and replace it with a short number, referring
back to the long sequence. In the example above, if we knew that "abcdf"
was a UNIT that would likely be repeated, we would put it into the
compression table as a whole, instead of relying on repetition to get it
into the table one extra byte at a time. That may let us compress the
original sequence down to "abcdfd<257 for cd><256 for abcdf><256 for
abcdf>" from the get go.

Yet the LZ variants I know of will need to see a 200-byte sequence repeated
**199 times** in order to develop a single, reusable, 200-byte long
subsequence in the compression table.

So, a Bitcoin-specific compressor can perhaps do significantly better, but
is it a good idea? Let's argue both sides.

Cons:

On the one hand, Bitcoin-specific compressors will be closely tied to the
contents of messages, which might make it difficult to change the wire
format later on -- changes to the wire format may need corresponding
changes to the compressor.  If the compressor cannot be implemented
cleanly, then the protocol-agnostic, off-the-shelf compressors have a
maintainability edge, which comes at the expense of the compression ratio.

Another argument is that compression algorithms of any kind should be
tested thoroughly before inclusion, and brand new code may lack the
maturity required. While this argument has some merit, all outputs are
verified separately later on during processing, so
compression/decompression errors can potentially be detected. If the
compressor/decompressor can be structured in a way that isolates bitcoind
from failure (e.g. as a separate process for starters), this concern can be
remedied.

Pros:

The nature of LZ compressors leads me to believe that much higher
compression ratios are possible by building a custom, Bitcoin-aware
compressor. If I had to guess, I would venture that compression ratios of
2X or more are possible in some cases. In some sense, the "O(1) block
propagation" idea that Gavin proposed a while ago can be seen as extreme
example of a Bitcoin-specific compressor, albeit one that constrains the
order of transactions in a block.

Compression can buy us some additional throughput at zero cost, modulo code
complexity.
Given the amount of acrimonious debate over the block size we have all had
to endure, it seems
criminal to leave potentially free improvements on the table. Even if the
resulting code is
deemed too complex to include in the production client right now, it would
be good to understand
the potential for improvement.

How to Do It

If we want to compress Bitcoin, a programming challenge/contest would be
one of the best ways to find the best possible, Bitcoin-specific
compressor. This is the kind of self-contained exercise that bright young
hackers love to tackle. It'd bring in new programmers into the ecosystem,
and many of us would love to discover the limits of compressibility for
Bitcoin bits on a wire. And the results would be interesting even if the
final compression engine is not enabled by default, or not even merged.

[-- Attachment #2: Type: text/html, Size: 6598 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-02 18:57   ` Emin Gün Sirer
@ 2015-12-02 20:16     ` Peter Tschipper
  2015-12-02 22:23       ` Matt Corallo
  2015-12-03 19:14     ` Gavin Andresen
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Tschipper @ 2015-12-02 20:16 UTC (permalink / raw)
  To: Emin Gün Sirer; +Cc: Bitcoin Dev

[-- Attachment #1: Type: text/plain, Size: 5548 bytes --]

Building a compressor from scratch may yeild some better compression
ratios, or not, but having trust and faith in whether it will stand up
against attack vectors another matter.  LZO has been around for 20 years
with very few problems and no current issues.  Maybe something better
can be built, but when and how much testing will need to be done before
it can be trusted?  Right now there is something that provides a benefit
and in the future if something better is found it's not that difficult
to add it.  We could easily support multiple compression libraries.


On 02/12/2015 10:57 AM, Emin Gün Sirer wrote:
> Thanks Peter for the careful, quantitative work.
>
> I want to bring one additional issue to everyone's consideration,
> related to the choice of the Lempel-Ziv family of compressors. 
>
> While I'm not familiar with every single compression engine tested,
> the Lempel-Ziv family of compressors are generally based on
> "compression tables." Essentially, they assign a short unique number
> to every new subsequence they encounter, and when they re-encounter a
> sequence like "ab" in "abcdfdcdabcdfabcdf" they replace it with that
> short integer (say, in this case, 9-bit constant 256). So this example
> sequence may turn into "abcdfd<258 for cd><256 for ab><258 for
> cd>f<261 for abc><259 for df>" which is slightly shorter than the
> original (I'm doing this off the top of my head so the counts may be
> off, but it's meant to be illustrative). Note that the sequence "abc"
> got added into the table only after it was encountered twice in the
> input. 
>
> This is nice and generic and works well for English text where certain
> letter sequences (e.g. "it" "th" "the" "this" "are" "there" etc) are
> repeated often, but it is nowhere as compact as it could possibly be
> for mostly binary data -- there are opportunities for much better
> compression, made possible by the structured reuse of certain byte
> sequences in the Bitcoin wire protocol.
>
> On a Bitcoin wire connection, we might see several related
> transactions reorganizing cash in a set of addresses, and therefore,
> several reuses of a 20-byte address. Or we might see a 200-byte
> transaction get transmitted, followed by the same transaction,
> repeated in a block. Ideally, we'd learn the sequence that may be
> repeated later on, all at once (e.g. a Bitcoin address or a
> transaction), and replace it with a short number, referring back to
> the long sequence. In the example above, if we knew that "abcdf" was a
> UNIT that would likely be repeated, we would put it into the
> compression table as a whole, instead of relying on repetition to get
> it into the table one extra byte at a time. That may let us compress
> the original sequence down to "abcdfd<257 for cd><256 for abcdf><256
> for abcdf>" from the get go.
>
> Yet the LZ variants I know of will need to see a 200-byte sequence
> repeated **199 times** in order to develop a single, reusable,
> 200-byte long subsequence in the compression table. 
>
> So, a Bitcoin-specific compressor can perhaps do significantly better,
> but is it a good idea? Let's argue both sides.
>
> Cons:
>
> On the one hand, Bitcoin-specific compressors will be closely tied to
> the contents of messages, which might make it difficult to change the
> wire format later on -- changes to the wire format may need
> corresponding changes to the compressor.  If the compressor cannot be
> implemented cleanly, then the protocol-agnostic, off-the-shelf
> compressors have a maintainability edge, which comes at the expense of
> the compression ratio. 
>
> Another argument is that compression algorithms of any kind should be
> tested thoroughly before inclusion, and brand new code may lack the
> maturity required. While this argument has some merit, all outputs are
> verified separately later on during processing, so
> compression/decompression errors can potentially be detected. If the
> compressor/decompressor can be structured in a way that isolates
> bitcoind from failure (e.g. as a separate process for starters), this
> concern can be remedied.
>
> Pros:
>
> The nature of LZ compressors leads me to believe that much higher
> compression ratios are possible by building a custom, Bitcoin-aware
> compressor. If I had to guess, I would venture that compression ratios
> of 2X or more are possible in some cases. In some sense, the "O(1)
> block propagation" idea that Gavin proposed a while ago can be seen as
> extreme example of a Bitcoin-specific compressor, albeit one that
> constrains the order of transactions in a block.
>
> Compression can buy us some additional throughput at zero cost, modulo
> code complexity. 
> Given the amount of acrimonious debate over the block size we have all
> had to endure, it seems 
> criminal to leave potentially free improvements on the table. Even if
> the resulting code is
> deemed too complex to include in the production client right now, it
> would be good to understand
> the potential for improvement.
>
> How to Do It
>
> If we want to compress Bitcoin, a programming challenge/contest would
> be one of the best ways to find the best possible, Bitcoin-specific
> compressor. This is the kind of self-contained exercise that bright
> young hackers love to tackle. It'd bring in new programmers into the
> ecosystem, and many of us would love to discover the limits of
> compressibility for Bitcoin bits on a wire. And the results would be
> interesting even if the final compression engine is not enabled by
> default, or not even merged.
>


[-- Attachment #2: Type: text/html, Size: 9853 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-02 20:16     ` Peter Tschipper
@ 2015-12-02 22:23       ` Matt Corallo
  2015-12-02 23:02         ` Peter Tschipper
  0 siblings, 1 reply; 15+ messages in thread
From: Matt Corallo @ 2015-12-02 22:23 UTC (permalink / raw)
  To: Peter Tschipper, Emin Gün Sirer; +Cc: Bitcoin Dev

My issue is more that its additional complexity and attack surface, and 
for a very minor gain which should disappear with further optimization 
elsewhere and less that we absolutely shouldn't add compression because 
we're definitely gonna have issues.

On 12/02/15 20:16, Peter Tschipper via bitcoin-dev wrote:
> Building a compressor from scratch may yeild some better compression
> ratios, or not, but having trust and faith in whether it will stand up
> against attack vectors another matter.  LZO has been around for 20 years
> with very few problems and no current issues.  Maybe something better
> can be built, but when and how much testing will need to be done before
> it can be trusted?  Right now there is something that provides a benefit
> and in the future if something better is found it's not that difficult
> to add it.  We could easily support multiple compression libraries.
>
>
> On 02/12/2015 10:57 AM, Emin Gün Sirer wrote:
>> Thanks Peter for the careful, quantitative work.
>>
>> I want to bring one additional issue to everyone's consideration,
>> related to the choice of the Lempel-Ziv family of compressors.
>>
>> While I'm not familiar with every single compression engine tested,
>> the Lempel-Ziv family of compressors are generally based on
>> "compression tables." Essentially, they assign a short unique number
>> to every new subsequence they encounter, and when they re-encounter a
>> sequence like "ab" in "abcdfdcdabcdfabcdf" they replace it with that
>> short integer (say, in this case, 9-bit constant 256). So this example
>> sequence may turn into "abcdfd<258 for cd><256 for ab><258 for
>> cd>f<261 for abc><259 for df>" which is slightly shorter than the
>> original (I'm doing this off the top of my head so the counts may be
>> off, but it's meant to be illustrative). Note that the sequence "abc"
>> got added into the table only after it was encountered twice in the
>> input.
>>
>> This is nice and generic and works well for English text where certain
>> letter sequences (e.g. "it" "th" "the" "this" "are" "there" etc) are
>> repeated often, but it is nowhere as compact as it could possibly be
>> for mostly binary data -- there are opportunities for much better
>> compression, made possible by the structured reuse of certain byte
>> sequences in the Bitcoin wire protocol.
>>
>> On a Bitcoin wire connection, we might see several related
>> transactions reorganizing cash in a set of addresses, and therefore,
>> several reuses of a 20-byte address. Or we might see a 200-byte
>> transaction get transmitted, followed by the same transaction,
>> repeated in a block. Ideally, we'd learn the sequence that may be
>> repeated later on, all at once (e.g. a Bitcoin address or a
>> transaction), and replace it with a short number, referring back to
>> the long sequence. In the example above, if we knew that "abcdf" was a
>> UNIT that would likely be repeated, we would put it into the
>> compression table as a whole, instead of relying on repetition to get
>> it into the table one extra byte at a time. That may let us compress
>> the original sequence down to "abcdfd<257 for cd><256 for abcdf><256
>> for abcdf>" from the get go.
>>
>> Yet the LZ variants I know of will need to see a 200-byte sequence
>> repeated **199 times** in order to develop a single, reusable,
>> 200-byte long subsequence in the compression table.
>>
>> So, a Bitcoin-specific compressor can perhaps do significantly better,
>> but is it a good idea? Let's argue both sides.
>>
>> Cons:
>>
>> On the one hand, Bitcoin-specific compressors will be closely tied to
>> the contents of messages, which might make it difficult to change the
>> wire format later on -- changes to the wire format may need
>> corresponding changes to the compressor.  If the compressor cannot be
>> implemented cleanly, then the protocol-agnostic, off-the-shelf
>> compressors have a maintainability edge, which comes at the expense of
>> the compression ratio.
>>
>> Another argument is that compression algorithms of any kind should be
>> tested thoroughly before inclusion, and brand new code may lack the
>> maturity required. While this argument has some merit, all outputs are
>> verified separately later on during processing, so
>> compression/decompression errors can potentially be detected. If the
>> compressor/decompressor can be structured in a way that isolates
>> bitcoind from failure (e.g. as a separate process for starters), this
>> concern can be remedied.
>>
>> Pros:
>>
>> The nature of LZ compressors leads me to believe that much higher
>> compression ratios are possible by building a custom, Bitcoin-aware
>> compressor. If I had to guess, I would venture that compression ratios
>> of 2X or more are possible in some cases. In some sense, the "O(1)
>> block propagation" idea that Gavin proposed a while ago can be seen as
>> extreme example of a Bitcoin-specific compressor, albeit one that
>> constrains the order of transactions in a block.
>>
>> Compression can buy us some additional throughput at zero cost, modulo
>> code complexity.
>> Given the amount of acrimonious debate over the block size we have all
>> had to endure, it seems
>> criminal to leave potentially free improvements on the table. Even if
>> the resulting code is
>> deemed too complex to include in the production client right now, it
>> would be good to understand
>> the potential for improvement.
>>
>> How to Do It
>>
>> If we want to compress Bitcoin, a programming challenge/contest would
>> be one of the best ways to find the best possible, Bitcoin-specific
>> compressor. This is the kind of self-contained exercise that bright
>> young hackers love to tackle. It'd bring in new programmers into the
>> ecosystem, and many of us would love to discover the limits of
>> compressibility for Bitcoin bits on a wire. And the results would be
>> interesting even if the final compression engine is not enabled by
>> default, or not even merged.
>>
>
>
>
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-02 22:23       ` Matt Corallo
@ 2015-12-02 23:02         ` Peter Tschipper
  2015-12-04 13:30           ` Matt Corallo
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Tschipper @ 2015-12-02 23:02 UTC (permalink / raw)
  To: Matt Corallo; +Cc: Bitcoin Dev

On 02/12/2015 2:23 PM, Matt Corallo wrote:
> My issue is more that its additional complexity and attack surface,
> and for a very minor gain 
What is a minor gain?  15 to 27% compression sounds good to me and the
larger the data the better the compression.  And although there is a
decent peformance gain in proportion to the % of compression, the
original motivation of the BIP was to reduce bandwidth for users in
regions where they are subject to caps. 
> which should disappear with further optimization elsewhere 
Why would the benefit of compressing data disappear with further
optimizations elsewhere, I'm not following you?.  The compression of
data mainly has benefit in the sending of packets over the network.  I
would think the performance gain would be cumulative.  Why would this go
away by optimizing elsewhere?

> and less that we absolutely shouldn't add compression because we're
> definitely gonna have issues.
It's not that difficult to add compression.  Even if there was an issue,
the compression feature can be completely turned off. 

>
> On 12/02/15 20:16, Peter Tschipper via bitcoin-dev wrote:
>> Building a compressor from scratch may yeild some better compression
>> ratios, or not, but having trust and faith in whether it will stand up
>> against attack vectors another matter.  LZO has been around for 20 years
>> with very few problems and no current issues.  Maybe something better
>> can be built, but when and how much testing will need to be done before
>> it can be trusted?  Right now there is something that provides a benefit
>> and in the future if something better is found it's not that difficult
>> to add it.  We could easily support multiple compression libraries.
>>
>>
>> On 02/12/2015 10:57 AM, Emin Gün Sirer wrote:
>>> Thanks Peter for the careful, quantitative work.
>>>
>>> I want to bring one additional issue to everyone's consideration,
>>> related to the choice of the Lempel-Ziv family of compressors.
>>>
>>> While I'm not familiar with every single compression engine tested,
>>> the Lempel-Ziv family of compressors are generally based on
>>> "compression tables." Essentially, they assign a short unique number
>>> to every new subsequence they encounter, and when they re-encounter a
>>> sequence like "ab" in "abcdfdcdabcdfabcdf" they replace it with that
>>> short integer (say, in this case, 9-bit constant 256). So this example
>>> sequence may turn into "abcdfd<258 for cd><256 for ab><258 for
>>> cd>f<261 for abc><259 for df>" which is slightly shorter than the
>>> original (I'm doing this off the top of my head so the counts may be
>>> off, but it's meant to be illustrative). Note that the sequence "abc"
>>> got added into the table only after it was encountered twice in the
>>> input.
>>>
>>> This is nice and generic and works well for English text where certain
>>> letter sequences (e.g. "it" "th" "the" "this" "are" "there" etc) are
>>> repeated often, but it is nowhere as compact as it could possibly be
>>> for mostly binary data -- there are opportunities for much better
>>> compression, made possible by the structured reuse of certain byte
>>> sequences in the Bitcoin wire protocol.
>>>
>>> On a Bitcoin wire connection, we might see several related
>>> transactions reorganizing cash in a set of addresses, and therefore,
>>> several reuses of a 20-byte address. Or we might see a 200-byte
>>> transaction get transmitted, followed by the same transaction,
>>> repeated in a block. Ideally, we'd learn the sequence that may be
>>> repeated later on, all at once (e.g. a Bitcoin address or a
>>> transaction), and replace it with a short number, referring back to
>>> the long sequence. In the example above, if we knew that "abcdf" was a
>>> UNIT that would likely be repeated, we would put it into the
>>> compression table as a whole, instead of relying on repetition to get
>>> it into the table one extra byte at a time. That may let us compress
>>> the original sequence down to "abcdfd<257 for cd><256 for abcdf><256
>>> for abcdf>" from the get go.
>>>
>>> Yet the LZ variants I know of will need to see a 200-byte sequence
>>> repeated **199 times** in order to develop a single, reusable,
>>> 200-byte long subsequence in the compression table.
>>>
>>> So, a Bitcoin-specific compressor can perhaps do significantly better,
>>> but is it a good idea? Let's argue both sides.
>>>
>>> Cons:
>>>
>>> On the one hand, Bitcoin-specific compressors will be closely tied to
>>> the contents of messages, which might make it difficult to change the
>>> wire format later on -- changes to the wire format may need
>>> corresponding changes to the compressor.  If the compressor cannot be
>>> implemented cleanly, then the protocol-agnostic, off-the-shelf
>>> compressors have a maintainability edge, which comes at the expense of
>>> the compression ratio.
>>>
>>> Another argument is that compression algorithms of any kind should be
>>> tested thoroughly before inclusion, and brand new code may lack the
>>> maturity required. While this argument has some merit, all outputs are
>>> verified separately later on during processing, so
>>> compression/decompression errors can potentially be detected. If the
>>> compressor/decompressor can be structured in a way that isolates
>>> bitcoind from failure (e.g. as a separate process for starters), this
>>> concern can be remedied.
>>>
>>> Pros:
>>>
>>> The nature of LZ compressors leads me to believe that much higher
>>> compression ratios are possible by building a custom, Bitcoin-aware
>>> compressor. If I had to guess, I would venture that compression ratios
>>> of 2X or more are possible in some cases. In some sense, the "O(1)
>>> block propagation" idea that Gavin proposed a while ago can be seen as
>>> extreme example of a Bitcoin-specific compressor, albeit one that
>>> constrains the order of transactions in a block.
>>>
>>> Compression can buy us some additional throughput at zero cost, modulo
>>> code complexity.
>>> Given the amount of acrimonious debate over the block size we have all
>>> had to endure, it seems
>>> criminal to leave potentially free improvements on the table. Even if
>>> the resulting code is
>>> deemed too complex to include in the production client right now, it
>>> would be good to understand
>>> the potential for improvement.
>>>
>>> How to Do It
>>>
>>> If we want to compress Bitcoin, a programming challenge/contest would
>>> be one of the best ways to find the best possible, Bitcoin-specific
>>> compressor. This is the kind of self-contained exercise that bright
>>> young hackers love to tackle. It'd bring in new programmers into the
>>> ecosystem, and many of us would love to discover the limits of
>>> compressibility for Bitcoin bits on a wire. And the results would be
>>> interesting even if the final compression engine is not enabled by
>>> default, or not even merged.
>>>
>>
>>
>>
>> _______________________________________________
>> bitcoin-dev mailing list
>> bitcoin-dev@lists.linuxfoundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>>
>



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-01  5:28 ` Matt Corallo
  2015-12-01 20:06   ` Pavel Janík
  2015-12-02 18:57   ` Emin Gün Sirer
@ 2015-12-02 23:05   ` Peter Tschipper
  2015-12-03  5:52     ` Dave Scotese
  2 siblings, 1 reply; 15+ messages in thread
From: Peter Tschipper @ 2015-12-02 23:05 UTC (permalink / raw)
  To: Matt Corallo, Bitcoin Dev

[-- Attachment #1: Type: text/plain, Size: 3164 bytes --]


On 30/11/2015 9:28 PM, Matt Corallo wrote:
> I'm really not a fan of this at all. To start with, adding a compression library that is directly accessible to the network on financial software is a really, really scary idea. 
Why scary?  LZO has no current security issues, and it will be
configureable by each node operator so it can be turned off completely
if needed or desired. 
> If there were a massive improvement, I'd find it acceptable, but the improvement you've shown really isn't all that much.
Why is 15% at the low end, to 27% at the high end not good?  It sounds
like a very good boost.   
>  The numbers you recently posted show it improving the very beginning of IBD somewhat over high-latency connections, but if we're throughput-limited after the very beginning of IBD, we should fix that, not compress the blocks. 
I only did the compression up to the 200,000 block to better isolate the
transmission of data from the post processing of blocks and determine
whether the compressing of data was adding to much to the total
transmission time.

I think it's clear from the data that as the data (blocks, transactions)
increase in size that (1) they compress better and (2) they have a
bigger and positive impact on improving performance when compressed.

> Additionally, I'd be very surprised if this had any significant effect on the speed at which new blocks traverse the network (do you have any simulations or other thoughts on this?).
From the table below, at 120000 blocks the time to sync the chain was
roughly the same for compressed vs. uncompressed however after that
point as block sizes start increasing, all compression libraries
peformed much faster than uncompressed. The data provided in this
testing clearly shows that as block size increases, the performance
improvement by compressing data also increases.

TABLE 5:
Results shown in seconds with 60ms of induced latency
Num blks sync'd  Uncmp  Zlib-1  Zlib-6  LZO1x-1  LZO1x-999
---------------  -----  ------  ------  -------  ---------
120000           3226   3416    3397    3266     3302
130000           4010   3983    3773    3625     3703
140000           4914   4503    4292    4127     4287
150000           5806   4928    4719    4529     4821
160000           6674   5249    5164    4840     5314
170000           7563   5603    5669    5289     6002
180000           8477   6054    6268    5858     6638
190000           9843   7085    7278    6868     7679
200000           11338  8215    8433    8044     8795


As far as, what happens after the block is received, then obviously
compression isn't going to help in post processing and validating the
block, but in the pure transmission of the object it most certainly and
logically does and in a fairly direct proportion to the file size (a
file that is 20% smaller will be transmited "at least" 20% faster, you
can use any data transfer time calculator
<http://www.calctool.org/CALC/prof/computing/transfer_time> for that). 
The only issue, that I can see that required testing was to show how
much compression there would be, and how much time the compression of
the data would add to the sending of the data.

 

[-- Attachment #2: Type: text/html, Size: 4499 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-02 23:05   ` Peter Tschipper
@ 2015-12-03  5:52     ` Dave Scotese
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Scotese @ 2015-12-03  5:52 UTC (permalink / raw)
  To: Peter Tschipper; +Cc: Bitcoin Dev

[-- Attachment #1: Type: text/plain, Size: 7579 bytes --]

Emin's email presents to me the idea of dictionaries that already contain
the data we'd want to compress.  With 8 bytes of indexing data, we can
refer to a TxID or a Public Key or any existing part of the blockchain.
There are also data sequences like scripts that contain a few variable
chunks and are otherwise identical.  Often, the receiver has the
blockchain, which contains a lot of the data that is in the message being
transmitted.

First, the receiver must indicate that compressed data is preferred and the
height of latest valid block it holds, and the sender must express the
ability to send compressed data.  From this state, the sender sends
messages that are compressed.  Compressed messages are the same as
uncompressed messages except that:

   1. Data read is copied into the decompressed message until the first
   occurrence of 0x00, which is discarded and is followed by compressed data.
   2. Compressed data can use as a dictionary the first 16,777,215 blocks,
   or the last 4,244,635,647 ending with the block at the tip of the
   receiver's chain, or it can specify a run of zero bytes.  The sender and
   receiver must agree on the *receiver's* current block height in order to
   use the last 4B blocks as the dictionary.
   3. Within compressed data, the first byte identifies how to decompress:
      1. 0xFF indicates that the following three bytes are a block height
      with most significant byte 0x00 in network byte order.
      2. 0xFE indicates that the following byte indicates how many zero
      bytes to add to the decompressed data.
      3. 0xFD is an error, so compressed messages are turned off and the
      recipient fails the decompression process.
      4. 0x00 indicates that the zero byte by itself should be added to the
      decompressed data, and the data following is not compressed
(return to step
      1).
      5. All other values represent the most significant byte of a number
      to be subtracted from the receiver's current block height to identify a
      block height (not available until there are least 16,777,216
blocks so that
      this byte can be at least 0x01, since 0x00 would indicate a single zero
      byte, end compressed data, and return to step 1).
   4. If decompression has identified a block height (previous byte was not
   0xFD, 0x00, or 0xFE), then the next four bytes identify a *size *(one
   byte) and a byte index into the block's data (three bytes), and *size *bytes
   from that block are added to the decompressed data.
   5. Steps 3 and 4 process a chunk of compressed data.  If the next byte
   is 0xFD, then decompression goes back to step 1 (add raw bytes until it
   hits a 0x00).  Otherwise, it proceeds through steps 3 (and maybe 4) again.

In Step 3.3, 0xFD causes an error, but it could be used to indicate a
parameterized dictionary entry, for example 0xFD, 0x01 followed by eight
more bytes to be interpreted according to steps 3.1 or 3.5 could mean
OP_DUP OP_HASH160 (20 bytes from the blockchain dictionary) OP_EQUALVERIFY
OP_CHECKSIG, replacing that very common occurrence of 24 bytes with 10
bytes.  Well, 11 if you include the 0x00 required by step5.  But that only
works on addresses that have spent inputs.  Or 0xFD, 0x02 could be
shorthand for the four zeroes of lock_time, followed by Version (1),
followed by 0x01 (for one-input transactions), turning nine bytes into two
for the data at the end of a normal (lock_time = 0) Txn and the beginning
of a single-input Txn.  But I left 0xFD as an error because those gains
didn't seem as frequent as the others.

Dave.

On Wed, Dec 2, 2015 at 3:05 PM, Peter Tschipper via bitcoin-dev <
bitcoin-dev@lists.linuxfoundation.org> wrote:

>
> On 30/11/2015 9:28 PM, Matt Corallo wrote:
>
> I'm really not a fan of this at all. To start with, adding a compression library that is directly accessible to the network on financial software is a really, really scary idea.
>
> Why scary?  LZO has no current security issues, and it will be
> configureable by each node operator so it can be turned off completely if
> needed or desired.
>
> If there were a massive improvement, I'd find it acceptable, but the improvement you've shown really isn't all that much.
>
> Why is 15% at the low end, to 27% at the high end not good?  It sounds
> like a very good boost.
>
>  The numbers you recently posted show it improving the very beginning of IBD somewhat over high-latency connections, but if we're throughput-limited after the very beginning of IBD, we should fix that, not compress the blocks.
>
> I only did the compression up to the 200,000 block to better isolate the
> transmission of data from the post processing of blocks and determine
> whether the compressing of data was adding to much to the total
> transmission time.
>
> I think it's clear from the data that as the data (blocks, transactions)
> increase in size that (1) they compress better and (2) they have a bigger
> and positive impact on improving performance when compressed.
>
> Additionally, I'd be very surprised if this had any significant effect on the speed at which new blocks traverse the network (do you have any simulations or other thoughts on this?).
>
> From the table below, at 120000 blocks the time to sync the chain was
> roughly the same for compressed vs. uncompressed however after that point
> as block sizes start increasing, all compression libraries peformed much
> faster than uncompressed. The data provided in this testing clearly shows
> that as block size increases, the performance improvement by compressing
> data also increases.
>
> TABLE 5:
> Results shown in seconds with 60ms of induced latency
> Num blks sync'd  Uncmp  Zlib-1  Zlib-6  LZO1x-1  LZO1x-999
> ---------------  -----  ------  ------  -------  ---------
> 120000           3226   3416    3397    3266     3302
> 130000           4010   3983    3773    3625     3703
> 140000           4914   4503    4292    4127     4287
> 150000           5806   4928    4719    4529     4821
> 160000           6674   5249    5164    4840     5314
> 170000           7563   5603    5669    5289     6002
> 180000           8477   6054    6268    5858     6638
> 190000           9843   7085    7278    6868     7679
> 200000           11338  8215    8433    8044     8795
>
>
> As far as, what happens after the block is received, then obviously
> compression isn't going to help in post processing and validating the
> block, but in the pure transmission of the object it most certainly and
> logically does and in a fairly direct proportion to the file size (a file
> that is 20% smaller will be transmited "at least" 20% faster, you can use
> any data transfer time calculator
> <http://www.calctool.org/CALC/prof/computing/transfer_time> for that).
> The only issue, that I can see that required testing was to show how much
> compression there would be, and how much time the compression of the data
> would add to the sending of the data.
>
>
>
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>
>


-- 
I like to provide some work at no charge to prove my value. Do you need a
techie?
I own Litmocracy <http://www.litmocracy.com> and Meme Racing
<http://www.memeracing.net> (in alpha).
I'm the webmaster for The Voluntaryist <http://www.voluntaryist.com> which
now accepts Bitcoin.
I also code for The Dollar Vigilante <http://dollarvigilante.com/>.
"He ought to find it more profitable to play by the rules" - Satoshi
Nakamoto

[-- Attachment #2: Type: text/html, Size: 9306 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-02 18:57   ` Emin Gün Sirer
  2015-12-02 20:16     ` Peter Tschipper
@ 2015-12-03 19:14     ` Gavin Andresen
  2015-12-03 23:07       ` Rusty Russell
  1 sibling, 1 reply; 15+ messages in thread
From: Gavin Andresen @ 2015-12-03 19:14 UTC (permalink / raw)
  To: Emin Gün Sirer; +Cc: Bitcoin Dev

[-- Attachment #1: Type: text/plain, Size: 1304 bytes --]

On Wed, Dec 2, 2015 at 1:57 PM, Emin Gün Sirer <
bitcoin-dev@lists.linuxfoundation.org> wrote:

> How to Do It
>
> If we want to compress Bitcoin, a programming challenge/contest would be
> one of the best ways to find the best possible, Bitcoin-specific
> compressor. This is the kind of self-contained exercise that bright young
> hackers love to tackle. It'd bring in new programmers into the ecosystem,
> and many of us would love to discover the limits of compressibility for
> Bitcoin bits on a wire. And the results would be interesting even if the
> final compression engine is not enabled by default, or not even merged.
>

I love this idea. Lets build a standardized data set to test against using
real data from the network (has anybody done this yet?).

Something like:

Starting network topology:
list of:  nodeid, nodeid, network latency between the two peers

Changes to network topology:
list of:  nodeid, add/remove nodeid, time of change

Transaction broadcasts:
list of :  transaction, node id that first broadcast, time first broadcast

Block broadcasts:
list of :  block, node id that first broadcast, time first broadcast

Proposed transaction/block optimizations could then be measured against
this standard data set.


-- 
--
Gavin Andresen

[-- Attachment #2: Type: text/html, Size: 2468 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-03 19:14     ` Gavin Andresen
@ 2015-12-03 23:07       ` Rusty Russell
  0 siblings, 0 replies; 15+ messages in thread
From: Rusty Russell @ 2015-12-03 23:07 UTC (permalink / raw)
  To: Gavin Andresen, Emin Gün Sirer; +Cc: Bitcoin Dev

Gavin Andresen via bitcoin-dev <bitcoin-dev@lists.linuxfoundation.org>
writes:
> On Wed, Dec 2, 2015 at 1:57 PM, Emin Gün Sirer <
> bitcoin-dev@lists.linuxfoundation.org> wrote:
>
>> How to Do It
>>
>> If we want to compress Bitcoin, a programming challenge/contest would be
>> one of the best ways to find the best possible, Bitcoin-specific
>> compressor. This is the kind of self-contained exercise that bright young
>> hackers love to tackle. It'd bring in new programmers into the ecosystem,
>> and many of us would love to discover the limits of compressibility for
>> Bitcoin bits on a wire. And the results would be interesting even if the
>> final compression engine is not enabled by default, or not even merged.
>>
>
> I love this idea. Lets build a standardized data set to test against using
> real data from the network (has anybody done this yet?).

https://github.com/rustyrussell/bitcoin-corpus

It includes mempool contents and tx receipt logs for 1 week across 4
nodes.  I vaguely plan to update it every year.

A more ambitious version would add some topology information, but we
need to figure out some anonymization strategy for the data.

Cheers,
Rusty.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions
  2015-12-02 23:02         ` Peter Tschipper
@ 2015-12-04 13:30           ` Matt Corallo
  0 siblings, 0 replies; 15+ messages in thread
From: Matt Corallo @ 2015-12-04 13:30 UTC (permalink / raw)
  To: Peter Tschipper; +Cc: Bitcoin Dev



On December 3, 2015 7:02:20 AM GMT+08:00, Peter Tschipper <peter.tschipper@gmail.com> wrote:
>On 02/12/2015 2:23 PM, Matt Corallo wrote:
>> My issue is more that its additional complexity and attack surface,
>> and for a very minor gain 
>What is a minor gain?  15 to 27% compression sounds good to me and the
>larger the data the better the compression.  And although there is a
>decent peformance gain in proportion to the % of compression, the
>original motivation of the BIP was to reduce bandwidth for users in
>regions where they are subject to caps.

Ok. It wasn't clear to me that you weren't also claiming at latency reduction as a result. In any case, the point I was making is that the p2p protocol isn't for every use-case. Indeed, I agree (as noted previously) that we should support people who have very restrictive data usage limits, but I don't think we need to do this in the p2p protocol. Considering we're in desperate need of more ways to sync, supporting syncing over slow and/or very restrictive connections is something maybe better addressed by a sync-over-http-via-cdn protocol than the p2p protocol.

>> which should disappear with further optimization elsewhere 
>Why would the benefit of compressing data disappear with further
>optimizations elsewhere, I'm not following you?.  The compression of
>data mainly has benefit in the sending of packets over the network.  I
>would think the performance gain would be cumulative.  Why would this
>go
>away by optimizing elsewhere?

My point is that, with limited further optimization, and especially after the first hundred thousand blocks, block download should nearly never be the thing limiting IBD speed.

>> and less that we absolutely shouldn't add compression because we're
>> definitely gonna have issues.
>It's not that difficult to add compression.  Even if there was an
>issue,
>the compression feature can be completely turned off. 

No matter how easily you can implement something, complexity always has cost. This is especially true in complicated, incredibly security critical applications exposed to the internet.

>>
>> On 12/02/15 20:16, Peter Tschipper via bitcoin-dev wrote:
>>> Building a compressor from scratch may yeild some better compression
>>> ratios, or not, but having trust and faith in whether it will stand
>up
>>> against attack vectors another matter.  LZO has been around for 20
>years
>>> with very few problems and no current issues.  Maybe something
>better
>>> can be built, but when and how much testing will need to be done
>before
>>> it can be trusted?  Right now there is something that provides a
>benefit
>>> and in the future if something better is found it's not that
>difficult
>>> to add it.  We could easily support multiple compression libraries.
>>>
>>>
>>> On 02/12/2015 10:57 AM, Emin Gün Sirer wrote:
>>>> Thanks Peter for the careful, quantitative work.
>>>>
>>>> I want to bring one additional issue to everyone's consideration,
>>>> related to the choice of the Lempel-Ziv family of compressors.
>>>>
>>>> While I'm not familiar with every single compression engine tested,
>>>> the Lempel-Ziv family of compressors are generally based on
>>>> "compression tables." Essentially, they assign a short unique
>number
>>>> to every new subsequence they encounter, and when they re-encounter
>a
>>>> sequence like "ab" in "abcdfdcdabcdfabcdf" they replace it with
>that
>>>> short integer (say, in this case, 9-bit constant 256). So this
>example
>>>> sequence may turn into "abcdfd<258 for cd><256 for ab><258 for
>>>> cd>f<261 for abc><259 for df>" which is slightly shorter than the
>>>> original (I'm doing this off the top of my head so the counts may
>be
>>>> off, but it's meant to be illustrative). Note that the sequence
>"abc"
>>>> got added into the table only after it was encountered twice in the
>>>> input.
>>>>
>>>> This is nice and generic and works well for English text where
>certain
>>>> letter sequences (e.g. "it" "th" "the" "this" "are" "there" etc)
>are
>>>> repeated often, but it is nowhere as compact as it could possibly
>be
>>>> for mostly binary data -- there are opportunities for much better
>>>> compression, made possible by the structured reuse of certain byte
>>>> sequences in the Bitcoin wire protocol.
>>>>
>>>> On a Bitcoin wire connection, we might see several related
>>>> transactions reorganizing cash in a set of addresses, and
>therefore,
>>>> several reuses of a 20-byte address. Or we might see a 200-byte
>>>> transaction get transmitted, followed by the same transaction,
>>>> repeated in a block. Ideally, we'd learn the sequence that may be
>>>> repeated later on, all at once (e.g. a Bitcoin address or a
>>>> transaction), and replace it with a short number, referring back to
>>>> the long sequence. In the example above, if we knew that "abcdf"
>was a
>>>> UNIT that would likely be repeated, we would put it into the
>>>> compression table as a whole, instead of relying on repetition to
>get
>>>> it into the table one extra byte at a time. That may let us
>compress
>>>> the original sequence down to "abcdfd<257 for cd><256 for
>abcdf><256
>>>> for abcdf>" from the get go.
>>>>
>>>> Yet the LZ variants I know of will need to see a 200-byte sequence
>>>> repeated **199 times** in order to develop a single, reusable,
>>>> 200-byte long subsequence in the compression table.
>>>>
>>>> So, a Bitcoin-specific compressor can perhaps do significantly
>better,
>>>> but is it a good idea? Let's argue both sides.
>>>>
>>>> Cons:
>>>>
>>>> On the one hand, Bitcoin-specific compressors will be closely tied
>to
>>>> the contents of messages, which might make it difficult to change
>the
>>>> wire format later on -- changes to the wire format may need
>>>> corresponding changes to the compressor.  If the compressor cannot
>be
>>>> implemented cleanly, then the protocol-agnostic, off-the-shelf
>>>> compressors have a maintainability edge, which comes at the expense
>of
>>>> the compression ratio.
>>>>
>>>> Another argument is that compression algorithms of any kind should
>be
>>>> tested thoroughly before inclusion, and brand new code may lack the
>>>> maturity required. While this argument has some merit, all outputs
>are
>>>> verified separately later on during processing, so
>>>> compression/decompression errors can potentially be detected. If
>the
>>>> compressor/decompressor can be structured in a way that isolates
>>>> bitcoind from failure (e.g. as a separate process for starters),
>this
>>>> concern can be remedied.
>>>>
>>>> Pros:
>>>>
>>>> The nature of LZ compressors leads me to believe that much higher
>>>> compression ratios are possible by building a custom, Bitcoin-aware
>>>> compressor. If I had to guess, I would venture that compression
>ratios
>>>> of 2X or more are possible in some cases. In some sense, the "O(1)
>>>> block propagation" idea that Gavin proposed a while ago can be seen
>as
>>>> extreme example of a Bitcoin-specific compressor, albeit one that
>>>> constrains the order of transactions in a block.
>>>>
>>>> Compression can buy us some additional throughput at zero cost,
>modulo
>>>> code complexity.
>>>> Given the amount of acrimonious debate over the block size we have
>all
>>>> had to endure, it seems
>>>> criminal to leave potentially free improvements on the table. Even
>if
>>>> the resulting code is
>>>> deemed too complex to include in the production client right now,
>it
>>>> would be good to understand
>>>> the potential for improvement.
>>>>
>>>> How to Do It
>>>>
>>>> If we want to compress Bitcoin, a programming challenge/contest
>would
>>>> be one of the best ways to find the best possible, Bitcoin-specific
>>>> compressor. This is the kind of self-contained exercise that bright
>>>> young hackers love to tackle. It'd bring in new programmers into
>the
>>>> ecosystem, and many of us would love to discover the limits of
>>>> compressibility for Bitcoin bits on a wire. And the results would
>be
>>>> interesting even if the final compression engine is not enabled by
>>>> default, or not even merged.
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> bitcoin-dev mailing list
>>> bitcoin-dev@lists.linuxfoundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>>>
>>



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-12-04 13:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-30 23:12 [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and Transactions Peter Tschipper
2015-12-01  5:28 ` Matt Corallo
2015-12-01 20:06   ` Pavel Janík
     [not found]     ` <565E30C6.1010002@bitcartel.com>
2015-12-02  6:47       ` Pavel Janík
2015-12-02  7:33         ` Simon Liu
2015-12-02 18:45           ` Patrick Strateman
2015-12-02 18:57   ` Emin Gün Sirer
2015-12-02 20:16     ` Peter Tschipper
2015-12-02 22:23       ` Matt Corallo
2015-12-02 23:02         ` Peter Tschipper
2015-12-04 13:30           ` Matt Corallo
2015-12-03 19:14     ` Gavin Andresen
2015-12-03 23:07       ` Rusty Russell
2015-12-02 23:05   ` Peter Tschipper
2015-12-03  5:52     ` Dave Scotese

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox