While there is no mention of virtualization in the side-channel article, the FLUSH+RELOAD paper [1] mentions virtualization and claims the clflush instruction works not only towards processes on the same OS, but also against processes in a separate guest OS if executed on the host OS (type 2 hypervisor) [2]. It also works if executed from within another guest OS (though that reduces the efficiency of the attack) [3].

Both the authors [4] and Vulnerability Note VU#976534 [5] claim disabling hypervisor memory page de-duplication prevents the attack. This could perhaps be a first step for bitcoin companies running their software on shared hosts; demand their allocated instances to be on hosts with this disabled. Question is how common it is for virtualization providers to offer that as an option.

TRESOR is is only applicable if running in a non-virtualized OS [6].

While TRESOR only implements AES, it seems it could work for ECDSA as well, as they use the four x86 debug registers to fit a 256 bit privkey [7] for the entire machine uptime, and then use other registers when doing the actual AES ops. They use the Intel AES-NI instruction set though, and since there is no corresponding instruction set for EC extra work would be required to manually implement EC math in assembler.

They actually do what Mike Hearn mentioned and disable preemption in Linux (their code runs in kernel space; they patched the kernel) to ensure atomicity. Not only do they manage to protect against memory attacks (and RAM/cache timing attacks) from other processes running on the same host, but even from root on the same host (from userland, the debug registers are only accessible through ptrace, which they patched, and they also disabled LKM & KMEM).

One could imagine different levels of TRESOR-like ECDSA with different tradeoffs of complexity vs security. For example, if one is fine with keeping the privkey(s) in RAM but want to avoid cache timing attacks, the signing could be implemented as a userspace program holding key(s) in RAM together with a kernel module providing a syscall for signing. Signing is then run with preemption using only x86 registers for intermediate data and then using e.g. movntps [8] to write to RAM without data being cached. The benefit of this compared with the full TRESOR approach is that it would not require a patched kernel, only a kernel module. It would also be simpler to implement compared to keeping the privkey in the debug registers for the entire machine uptime, especially if multiple privkeys are used. It would not protect against root though, since an adversary getting root could load their own kernel module and read the registers.

To handle multiple keys (maybe as one-time-use) and get full TRESOR benefits, one could perhaps (with the original TRESOR approach, i.e. with patched kernel) store a BIP 0032 starting string / seed + counter in the debug registers and have the atomic kernel code generate new keys and do the signing.

Cheers,
Gustav Simonsson

1. http://eprint.iacr.org/2013/448.pdf
2. Page 1 of [1]
3. page 5 of [1]
4. page 8 (end of conclusions section) of [1]
5. http://www.kb.cert.org/vuls/id/976534
6. page 8, "3.2 Hardware compatibility", https://www.usenix.org/legacy/event/sec11/tech/full_papers/Muller.pdf
7. page 3, "2.2 Key Management" of [6]
8. page 1041 of http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf

On Thu, Mar 6, 2014 at 9:38 AM, Mike Hearn <mike@plan99.net> wrote:

I'm wondering about whether (don't laugh) moving signing into the kernel and then using the MTRRs to disable caching entirely for a small scratch region of memory would also work. You could then disable pre-emption and prevent anything on the same core from interrupting or timing the signing operation.

However I suspect just making a hardened secp256k1 signer implementation in userspace would be of similar difficulty, in which case it would naturally be preferable.

On Wed, Mar 5, 2014 at 11:25 PM, Gregory Maxwell <gmaxwell@gmail.com> wrote:

On Wed, Mar 5, 2014 at 2:14 PM, Eric Lombrozo <elombrozo@gmail.com> wrote:
> Everything you say is true.
>
> However, branchless does reduce the attack surface considerably - if nothing else, it significantly ups the difficulty of an attack for a relatively low cost in program complexity, and that might still make it worth doing.

Absolutely. I believe these things are worth doing.

My comment on it being insufficient was only that "my signer is
branchless" doesn't make other defense measures (avoiding reuse,
multsig with multiple devices, not sharing hardware, etc.)
unimportant.

> As for uniform memory access, if we avoided any kind of heap allocation, wouldn't we avoid such issues?

No. At a minimum to hide a memory timing side-channel you must perform
no data dependent loads (e.g. no operation where an offset into memory
is calculated). A strategy for this is to always load the same values,
but then mask out the ones you didn't intend to read... even that I'd
worry about on sufficiently advanced hardware, since I would very much
not be surprised if the processor was able to determine that the load
had no effect and eliminate it! :) )

Maybe in practice if your data dependencies end up only picking around
in the same cache-line it doesn't actually matter... but it's hard to
be sure, and unclear when a future optimization in the rest of the
system might leave it exposed again.

(In particular, you can't generally write timing sign-channel immune
code in C (or other high level language) because the compiler is
freely permitted to optimize things in a way that break the property.
... It may be _unlikely_ for it to do this, but its permitted— and
will actually do so in some cases—, so you cannot be completely sure
unless you check and freeze the toolchain)

> Anyhow, without having gone into the full details of this particular attack, it seems the main attack point is differences in how squaring and multiplication (in the case of field exponentiation) or doubling and point addition (in the case of ECDSA) are performed. I believe using a branchless implementation where each phase of the operation executes the exact same code and accesses the exact same stack frames would not be vulnerable to FLUSH+RELOAD.

I wouldn't be surprised.

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Bitcoin-development mailing list
Bitcoin-development@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bitcoin-development

------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that actually works.
Faster operations. Version large binaries. Built-in WAN optimization and the
freedom to use Git, Perforce or both. Make the move to Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
Bitcoin-development mailing list
Bitcoin-development@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bitcoin-development