Bignum multiplication improvement for ARM

For Gnuk, it is good to speed up RSA routine.

Last week, I improved a bit. Digital signing by Gnuk, it took 1.78 second (in version 0.12). With the change, it takes 1.72 second. (Majored by time command for gpg --clearsign. It includes calculation time on host and communication time.)

Then, I improved more. With the change, it takes 1.63 second.

Futher, I improved more. With today's change, it takes 1.54 second.

More, I improved. For Gnuk specific version, it just takes 1.48 second.

To be summarized:

  • Use UMULL (32-bitx32-bit => 64-bit) instead of UMULAL (mul and accumulate)
  • Loading/storing with more registers using LDM and STM
  • Use GCC constraints for registers, condition code, and memory

Note that it's 2048-bit RSA computation. Therefore, it is 1024-bit by 1024-bit multiplication because of CRT. For such a not so long size, Karatsuba (or any divide-and-conquer strategy) doesn't make sense, but tuning in assembly language is important.

Here is the ticket of mine: http://polarssl.org/trac/ticket/26