Omitting nBits entirely seems reasonable, I wrote up a possible implementation here. The downside is that it is more complex because it leaks into the validation code. The extra 4 byte savings is certainly nice though.
Can you elaborate on how parallel header fetching might work? getheaders requests could probably already be pipelined, where the node requests the next 2,000 headers before processing the current batch (though would make sense to check that they are all above min difficulty first).