Natanael,
=== Metal Layers ===
One factor in chip cost other than transistor count is the number of layers required to route all the interconnects in the desired die area constraint. The need for fewer layers can result in less patent-able costs of layering technology. Fewer layers are quicker and easier to manufacture.
I'm not an expert in the field, and I can't vouch for the validity of the entirety of the paper, but this paper discusses various factors that impact chip cost design.
=== Early nonce mixing, Variable Length Input with Near Constant Work ===
To minimize asicboost like optimizations... the entirety of the input should be mixed with the nonce data ASAP. For example with Bitcoin as it is now, the 80 byte block header doesn't fully fit in one 64 byte SHA256 input block. This results in a 2nd SHA256 block input that only has 4 bytes of nonce and the rest constant that are mixed much later than the rest of the input... which allows for unexpected optimizations.
Solution: A hash algorithm that could have more linear computation time vs input size would be a 2 stage algorithm:
1. 1st stage Merkle tree hash to pre-lossy-mix-compress the variable length input stream to the size of the 2nd stage state vector. Each bit of input should have about equal influence on each of the output bits. (Minimize information loss, maximize mixed-ness).
2. Multi-round mixing of the 2nd stage, where this stage is significantly more work than the 1st stage.
This is somewhat done already in Bitcoin by the PoW doing SHA256 twice in serial. The first time is pretty much the merkle tree hash (a node with two children), and then the second time is the mult-round mixing. If the Bitcoin PoW did SHA256 three or four times or more, then asicboost like optimizations would have less of an effect.
In actual hardware, assuming a particular input length for the design can result in a significantly more optimized design than creating hardware that can handle a variable length input. So your design goal of "not linear in performance relative to input size" to me seems to be a hard one to attain... in practical, to support very large input sizes in a constant work fashion requires a trade off between memory/parallelization and die space. I think it would be better to make an assumption about the block header size, such as that it is exactly 80 bytes, or, at least something reasonable like the hardware should be able to support a block header size <= 128 bytes.
Cheers,
Praxeology Guy