From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <lf-lists@mattcorallo.com>
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org
	[172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id AB62F892
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Wed,  2 Dec 2015 22:23:51 +0000 (UTC)
X-Greylist: from auto-whitelisted by SQLgrey-1.7.6
Received: from mail.bluematt.me (mail.bluematt.me [192.241.179.72])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id E8F44152
	for <bitcoin-dev@lists.linuxfoundation.org>;
	Wed,  2 Dec 2015 22:23:50 +0000 (UTC)
Received: from [172.17.0.2] (gw.vpn.bluematt.me [162.243.132.6])
	by mail.bluematt.me (Postfix) with ESMTPSA id 3CC655A6B7;
	Wed,  2 Dec 2015 22:23:49 +0000 (UTC)
To: Peter Tschipper <peter.tschipper@gmail.com>,
	=?UTF-8?Q?Emin_G=c3=bcn_Sirer?= <el33th4x0r@gmail.com>
References: <565CD7D8.3070102@gmail.com>
	<90EF4E6C-9A71-4A35-A938-EAFC1A24DD24@mattcorallo.com>
	<CAPkFh0t9SwVOLrPnL7z80s-Rriezhqxn_3vXKYRxr6JVGNiUZQ@mail.gmail.com>
	<565F5193.1070802@gmail.com>
From: Matt Corallo <lf-lists@mattcorallo.com>
Message-ID: <565F6F73.5050906@mattcorallo.com>
Date: Wed, 2 Dec 2015 22:23:47 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
	Thunderbird/38.3.0
MIME-Version: 1.0
In-Reply-To: <565F5193.1070802@gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	smtp1.linux-foundation.org
Cc: Bitcoin Dev <bitcoin-dev@lists.linuxfoundation.org>
Subject: Re: [bitcoin-dev] [BIP Draft] Datastream compression of Blocks and
 Transactions
X-BeenThere: bitcoin-dev@lists.linuxfoundation.org
X-Mailman-Version: 2.1.12
Precedence: list
List-Id: Bitcoin Development Discussion <bitcoin-dev.lists.linuxfoundation.org>
List-Unsubscribe: <https://lists.linuxfoundation.org/mailman/options/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=unsubscribe>
List-Archive: <http://lists.linuxfoundation.org/pipermail/bitcoin-dev/>
List-Post: <mailto:bitcoin-dev@lists.linuxfoundation.org>
List-Help: <mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=help>
List-Subscribe: <https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev>,
	<mailto:bitcoin-dev-request@lists.linuxfoundation.org?subject=subscribe>
X-List-Received-Date: Wed, 02 Dec 2015 22:23:51 -0000

My issue is more that its additional complexity and attack surface, and 
for a very minor gain which should disappear with further optimization 
elsewhere and less that we absolutely shouldn't add compression because 
we're definitely gonna have issues.

On 12/02/15 20:16, Peter Tschipper via bitcoin-dev wrote:
> Building a compressor from scratch may yeild some better compression
> ratios, or not, but having trust and faith in whether it will stand up
> against attack vectors another matter.  LZO has been around for 20 years
> with very few problems and no current issues.  Maybe something better
> can be built, but when and how much testing will need to be done before
> it can be trusted?  Right now there is something that provides a benefit
> and in the future if something better is found it's not that difficult
> to add it.  We could easily support multiple compression libraries.
>
>
> On 02/12/2015 10:57 AM, Emin Gün Sirer wrote:
>> Thanks Peter for the careful, quantitative work.
>>
>> I want to bring one additional issue to everyone's consideration,
>> related to the choice of the Lempel-Ziv family of compressors.
>>
>> While I'm not familiar with every single compression engine tested,
>> the Lempel-Ziv family of compressors are generally based on
>> "compression tables." Essentially, they assign a short unique number
>> to every new subsequence they encounter, and when they re-encounter a
>> sequence like "ab" in "abcdfdcdabcdfabcdf" they replace it with that
>> short integer (say, in this case, 9-bit constant 256). So this example
>> sequence may turn into "abcdfd<258 for cd><256 for ab><258 for
>> cd>f<261 for abc><259 for df>" which is slightly shorter than the
>> original (I'm doing this off the top of my head so the counts may be
>> off, but it's meant to be illustrative). Note that the sequence "abc"
>> got added into the table only after it was encountered twice in the
>> input.
>>
>> This is nice and generic and works well for English text where certain
>> letter sequences (e.g. "it" "th" "the" "this" "are" "there" etc) are
>> repeated often, but it is nowhere as compact as it could possibly be
>> for mostly binary data -- there are opportunities for much better
>> compression, made possible by the structured reuse of certain byte
>> sequences in the Bitcoin wire protocol.
>>
>> On a Bitcoin wire connection, we might see several related
>> transactions reorganizing cash in a set of addresses, and therefore,
>> several reuses of a 20-byte address. Or we might see a 200-byte
>> transaction get transmitted, followed by the same transaction,
>> repeated in a block. Ideally, we'd learn the sequence that may be
>> repeated later on, all at once (e.g. a Bitcoin address or a
>> transaction), and replace it with a short number, referring back to
>> the long sequence. In the example above, if we knew that "abcdf" was a
>> UNIT that would likely be repeated, we would put it into the
>> compression table as a whole, instead of relying on repetition to get
>> it into the table one extra byte at a time. That may let us compress
>> the original sequence down to "abcdfd<257 for cd><256 for abcdf><256
>> for abcdf>" from the get go.
>>
>> Yet the LZ variants I know of will need to see a 200-byte sequence
>> repeated **199 times** in order to develop a single, reusable,
>> 200-byte long subsequence in the compression table.
>>
>> So, a Bitcoin-specific compressor can perhaps do significantly better,
>> but is it a good idea? Let's argue both sides.
>>
>> Cons:
>>
>> On the one hand, Bitcoin-specific compressors will be closely tied to
>> the contents of messages, which might make it difficult to change the
>> wire format later on -- changes to the wire format may need
>> corresponding changes to the compressor.  If the compressor cannot be
>> implemented cleanly, then the protocol-agnostic, off-the-shelf
>> compressors have a maintainability edge, which comes at the expense of
>> the compression ratio.
>>
>> Another argument is that compression algorithms of any kind should be
>> tested thoroughly before inclusion, and brand new code may lack the
>> maturity required. While this argument has some merit, all outputs are
>> verified separately later on during processing, so
>> compression/decompression errors can potentially be detected. If the
>> compressor/decompressor can be structured in a way that isolates
>> bitcoind from failure (e.g. as a separate process for starters), this
>> concern can be remedied.
>>
>> Pros:
>>
>> The nature of LZ compressors leads me to believe that much higher
>> compression ratios are possible by building a custom, Bitcoin-aware
>> compressor. If I had to guess, I would venture that compression ratios
>> of 2X or more are possible in some cases. In some sense, the "O(1)
>> block propagation" idea that Gavin proposed a while ago can be seen as
>> extreme example of a Bitcoin-specific compressor, albeit one that
>> constrains the order of transactions in a block.
>>
>> Compression can buy us some additional throughput at zero cost, modulo
>> code complexity.
>> Given the amount of acrimonious debate over the block size we have all
>> had to endure, it seems
>> criminal to leave potentially free improvements on the table. Even if
>> the resulting code is
>> deemed too complex to include in the production client right now, it
>> would be good to understand
>> the potential for improvement.
>>
>> How to Do It
>>
>> If we want to compress Bitcoin, a programming challenge/contest would
>> be one of the best ways to find the best possible, Bitcoin-specific
>> compressor. This is the kind of self-contained exercise that bright
>> young hackers love to tackle. It'd bring in new programmers into the
>> ecosystem, and many of us would love to discover the limits of
>> compressibility for Bitcoin bits on a wire. And the results would be
>> interesting even if the final compression engine is not enabled by
>> default, or not even merged.
>>
>
>
>
> _______________________________________________
> bitcoin-dev mailing list
> bitcoin-dev@lists.linuxfoundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/bitcoin-dev
>