So I checked filter sizes (as a proportion of block size) for each of the sub-filters. The graph is attached.

As interpretation, the first ~120,000 blocks are so small that the Golomb-Rice coding can't compress the filters that well, which is why the filter sizes are so high proportional to the block size. Except for the input filter, because the coinbase input is skipped, so many of them have 0 elements. But after block 120,000 or so, the filter compression converges pretty quickly to near the optimal value. The encouraging thing here is that if you look at the ratio of the combined size of the separated filters vs the size of a filter containing all of them (currently known as the basic filter), they are pretty much the same size. The mean of the ratio between them after block 150,000 is 99.4%. So basically, not much compression efficiently is lost by separating the basic filter into sub-filters.

On Tue, May 22, 2018 at 5:42 PM, Jim Posen <jim.posen@gmail.com> wrote:

My suggestion was to advertise a bitfield for each filter type the node serves,
where the bitfield indicates what elements are part of the filters. This essentially
removes the notion of decided filter types and instead leaves the decision to
full-nodes.

I think it makes more sense to construct entirely separate filters for the different types of elements and allow clients to download only the ones they care about. If there are enough elements per filter, the compression ratio shouldn't be much worse by splitting them up. This prevents the exponential blowup in the number of filters that you mention, Johan, and it works nicely with service bits for advertising different filter types independently.

So if we created three separate filter types, one for output scripts, one for input outpoints, and one for TXIDs, each signaled with a separate service bit, are people good with that? Or do you think there shouldn't be a TXID filter at all, Matt? I didn't include the option of a prev output script filter or rolling that into the block output script filter because it changes the security model (cannot be proven to be correct/incorrect succinctly).

Then there's the question of whether to separate or combine the headers. I'd lean towards keeping them separate because it's simpler that way.