On the edge of Curve25519 (safegcd256 for 32-bit machine)

19 x 31 = 589 >= 587

I implemented safegcd256 for Gnuk. It uses signed 31-bit integers, and it does 19 iterations for p = 2**255 - 19.

Modular inversion

In the computation of X25519, we need to compute the modular inversion at the last step to get the value of x in the affine coordinate. It approximately takes about 10% of cycles, when it's done by computing (p-2)-th power.

Current state of the art is the method called "safegcd". It's faster than computing (p-2)-th power.

Here are links:

Fast constant-time gcd and modular inversion: https://gcd.cr.yp.to/

Bounds on divsteps iterations in safegcd: https://github.com/sipa/safegcd-bounds#readme

The safegcd implementation in libsecp256k1 explained: https://github.com/bitcoin-core/secp256k1/blob/master/doc/safegcd_implementation.md

590 iterations for any 256-bit, but 587 iterations is enough for 2**255 - 19

The 32-bit safegcd implementation in libsecp256k1, it does 20 iterations with 30-bit, so that more than 590 iterations as a whole can be computed (20 times 30 equals to 600 >= 590).

For Curve25519, where it's p is 2**255 - 19, 587 iterations is enough, actually.

If we can compute with 31-bit at a time, it can be 19 iterations, since 19 times 31 equals to 589 >= 587.

19 iterations with 31-bit to achieve more than or equals to 587

Tricky part is the computation step with 31-bit.

Instead of the variable u, I use u_ which holds the value of u - 1. While u may be 2**31 which cannot be represented by int32_t, u_ is safe with int32_t.

Likewise, instead of the variable md, I use md_ which holds the value of md - 1. While md may be 2**31 which cannot be represented by int32_t, md_ is safe with int32_t.

I modified the expressions with u by u_, in a way that no overflow may occur. Likewise, I modified the expressions with md by md_.

19 is good for 2**255 - 19

About 5% (1 iterations of 20 iterations) computation cycles can be reduced in safegcd for 32-bit machine. At maximum, we can expect something like 0.5% improvement for X25519.

Well, it's not much. However, 19 iterations is good for 2**255 - 19.