I think you're trying to solve something a little out of scope,
really. Most of the issues aren't really issues for other clients
using established storage mechanisms (bdb,SQLite,etc) and they're
using them precisely because this is a problem that people have
been working on for decades and a poor candidate for reinvention.
If you really look at what you're proposing it's fundamentally
how bdb operates except your indexing format is usage domain
specific and you're in charge of all the resource management
semantics. While at the same time you'll be missing many of the
newer features that make working with/recovering/diagnosing issues
in the storage layer easier.
Devs,
I have decided to upgrade Armory's blockchain utilities, partly
out of necessity due to a poor code decision I made before I
even decided I was making a client. In an effort to avoid such
mistakes again, I want to do it "right" this time around, and
realize that this is a good discussion for all the devs that
will have to deal with this eventually...
The part I'm having difficulty with, is the idea that in a few
years from now, it just may not be feasible to hold transactions
file-pointers in RAM, because even that would overwhelm
standard RAM sizes. Without any degree of blockchain
compression, I see that the most general, scalable solution is
probably a complicated one.
On the other hand, where this fails may be where we have already
predicted that the network will have to split into "super-nodes"
and "lite nodes." In which case, this discussion is still a
good one, but just directed more towards the super-nodes. But,
there may still be a point at which super-nodes don't have
enough RAM to hold this data...
(1) As for how small you can get the data: my original idea
was that the entire blockchain is stored on disk as blkXXXX.dat
files. I store all transactions as 10-byte "file-references."
10 bytes would be
-- X in blkX.dat (2 bytes)
-- Tx start byte (4 bytes)
-- Tx size bytes (4 bytes)
The file-refs would be stored in a multimap indexed by the first
6 bytes of the tx-hash. In this way, when I search the
multimap, I potentially get a list of file-refs, and I might
have to retrieve a couple of tx from disk before finding the
right one, but it would be a good trade-off compared to storing
all 32 bytes (that's assuming that multimap nodes don't have too
much overhead).
But even with this, if there are 1,000,000,000 transactions in
the blockchain, each node is probably 48 bytes (16 bytes +
map/container overhead), then you're talking about 48 GB to
track all the data in RAM. mmap() may help here, but I'm not
sure it's the right solution
(2) What other ways are there, besides some kind of blockchain
compression, to maintain a multi-terabyte blockchain, assuming
that storing references to each tx would overwhelm available
RAM? Maybe that assumption isn't necessary, but I think it
prepares for the worst.
Or maybe I'm too narrow in my focus. How do other people
envision this will be handled in the future. I've heard so many
vague notions of "well we could do this or that,
or it wouldn't be hard to do that" but I haven't heard
any serious proposals for it. And while I believe that
blockchain compression will become ubiquitous in the future, not
everyone believes that, and there will undoubtedly be users/devs
that want to maintain everything under all
circumstances.
-Alan