A few days ago, a pull re­quest to sta­bil­ise one of my fa­vour­ite lib­rary func­tions, [T]::align_to has landed. Since the func­tion, to the best of my know­ledge, has no coun­ter­parts in other lan­guages1, I really wanted to de­tail what it took to make this func­tion pos­sible. [T]::align_to de­scribes it­self as such:

pub unsafe fn align_to<U>(&self) -> (&[T], &[U], &[T])

Trans­mute the slice to a slice of an­other type, en­sur­ing align­ment of the types is main­tained.

This de­scrip­tion is fairly terse and cryptic, but is also the best I could come up for what the func­tion does. The doc­u­ment­a­tion con­tin­ues:

This method splits the slice into three dis­tinct slices: pre­fix, cor­rectly aligned middle slice of a new type, and the suf­fix slice. The method does a best ef­fort to make the middle slice the greatest length pos­sible for a given type and in­put slice, but only your al­gorith­m’s per­form­ance should de­pend on that, not its cor­rect­ness.

The last sen­tence hints at [T]::align_to be­ing made spe­cific­ally to aid in op­tim­isa­tion. For ex­ample, al­gorithms op­er­at­ing on byte buf­fers be­come much faster if they are able to take ad­vant­age of the SIMD op­er­a­tions provided by most of the mod­ern com­puter ar­chi­tec­tures. Alas, these op­er­a­tions also tend to im­pose strict align­ment re­quire­ments for the data they op­er­ate on, mak­ing these op­er­a­tions im­possible to ap­ply to byte buf­fers without first align­ing them as re­quired. With [T]::align_to it be­comes very easy to do:

fn my_algorithm(bytes: &[u8]) {
    unsafe {
        let (prefix, simd, suffix) = bytes.align_to::<__m128>();
        less_fast_algorithm_for_bytes(prefix);
        more_fast_simd_algorithm(simd);
        less_fast_algorithm_for_bytes(suffix);
    }
}

Align­ing a byte buf­fer like this is the com­mon case and this spe­cific case is fairly easy to im­ple­ment by hand. The dif­fi­culties (and dif­fer­ences) rear their head when T stored in the slice is any­thing but a byte. [T]::align_to::<U> is cap­able of align­ing the buf­fer for any T and U and does it about as ef­fi­ciently as pos­sible. Alas, the im­ple­ment­a­tion ended up be­ing hard to un­der­stand and re­view. Hope­fully, this post will be suf­fi­cient in help­ing oth­ers to un­der­stand how the im­ple­ment­a­tion works while also be­ing use­ful to a curi­ous passerby.

The im­ple­ment­a­tion of [T]::align_to

At its core, the al­gorithm splits the in­put slice into three dis­joint slices. The length of the first slice, and off­set of the 2nd slice, is cal­cu­lated by an­other, cur­rently un­stable, func­tion pointer::align_offset. This func­tion solves the fol­low­ing lin­ear con­gru­ence equa­tion for the vari­able oo:

p+so0modap + so \equiv 0 \mod a

Here pp is the ad­dress of the first ele­ment in the in­put slice, aa is the align­ment of the out­put type U, ss is the stride of the type T. oo is there­fore the “off­set”, in num­ber of ele­ments of type T, to an aligned ele­ment of type U. Ex­pressed in words, this equa­tion says that the ad­dress pp, ad­vanced by soso bytes must be aligned to aa. The for­mula to solve this equa­tion for oo looks like this:

o=a(pmoda)gcd(s,a)×((sgcd(s,a))1moda) o = \frac{a - \left(p \mod a\right)}{\operatorname{gcd}(s, a)} \times \left(\left(\frac{s}{\operatorname{gcd}(s, a)}\right)^{-1} \mod a\right)

oo ob­tained by solv­ing this equa­tion still needs ex­tra pro­cessing as it is not guar­an­teed to be pos­it­ive nor smal­lest pos­it­ive solu­tion for the equa­tions. These prop­er­ties are re­quired to en­sure that the middle slice ends up con­tain­ing the greatest num­ber of ele­ments pos­sible. To ob­tain a num­ber of T ele­ments sat­is­fy­ing these prop­er­ties, it is ne­ces­sary to com­pute r=omodagcd(s,a)r = o \mod \frac{a}{\operatorname{gcd}(s, a)}. The first slice re­turned by this func­tion is then input_slice[..r].

Loc­a­tion of the second split is slightly less dif­fi­cult to ar­rive at. lcm(ts,us)\operatorname{lcm}\left(t_s, u_s\right), where tst_s and usu_s are size_of::<T>() and size_of::<U>() re­spect­ively, tells us the ne­ces­sary pointer align­ment for the 3rd slice. This align­ment en­sures that the 2nd slice con­tains a whole num­ber of ele­ments and that 3rd slice be­gins with a proper ele­ment of type T. To en­sure that the length of the middle slice is max­im­ised, last slice should con­tain at most usgcd(ts,us)\frac{u_s}{\operatorname{gcd}\left(t_s, u_s\right)} ele­ments. With this in­form­a­tion in hand, ob­tain­ing the off­set and length of the second slice be­comes trivi­al.

Now, it is pretty clear that a large num­ber of fairly ex­pens­ive op­er­a­tions are in­volved in split­ting the in­put slice into parts:

  • Mod­ulo and di­vi­sion are well known as be­ing a fairly ex­pens­ive in­struc­tions to ex­ecute tak­ing at least a few dozen of cycles to re­tire;
  • Al­gorithms for cal­cu­lat­ing the greatest com­mon di­visor and least com­mon mul­tiple are non-trivi­al;
  • Al­gorithm to cal­cu­late a mod­ulo in­verse (x1modyx^{-1} \mod y) is even more com­plic­ated…

No way a na­ive im­ple­ment­a­tion is ap­pro­pri­ate for a func­tion in­ten­ded to be used in op­tim­isa­tion. In most cases it would take longer to ex­ecute [T]::align_to than to ex­ecute some al­gorithm for Ts in the first place! Luck­ily, cer­tain know­ledge, such as align­ments be­ing a power of two, al­low us to em­ploy some tricks to im­ple­ment these op­er­a­tions in a more ef­fi­cient man­ner.

Mod­ulo and di­vi­sion with a power of two right hand side

amod2ka \mod 2^k and a2k\frac{a}{2^k}, can be re­writ­ten as a(2k1)a \land (2^k - 1) and aka \gg k re­spect­ively. This ex­ploits the bin­ary rep­res­ent­a­tion of un­signed in­tegers and will com­plete in one or two cycles, com­pared to a few dozen for the ori­ginal op­er­a­tion.

For the sake of ex­ample con­sider a 4-bit in­teger with bits b3b_3, b2b_2, b1b_1 and b0b_0 di­vided by 2k2^k:

b3×23+b2×22+b1×21+b0×202k=b3×23k+b2×22k+b1×21k+b0×20k\frac{b_3 \times 2^3 + b_2 \times 2 ^ 2 + b_1 \times 2 ^ 1 + b_0 \times 2 ^ 0}{2^k} = b_3 \times 2^{3-k} + b_2 \times 2 ^ {2 - k} + b_1 \times 2 ^ {1 - k} + b_0 \times 2 ^ {0 - k}

Know­ing that a 4-bit in­tegers can­not rep­res­ent any di­gits other than b3b_3 to b0b_0 and that all the other di­gits are dropped (or “trun­cated”), it be­comes fairly easy to see why di­vi­sion by 2k2^k and right shift by kk po­s­i­tions are equi­val­ent. This trivi­ally ex­tends to un­signed in­tegers of any width.

In or­der to see why a num­ber mod­ulo 2k2^k is equi­val­ent to a num­ber masked by (2k1)(2^k - 1), first con­sider that 2k12^k - 1 will have k low­est bits set to 11. For ex­ample:

231=7=11122^3 - 1 = 7 = 111_2 281=255=1111111122^8 - 1 = 255 = 1111\ 1111_2 2161=65535=111111111111111122^16 - 1 = 65535 = 1111\ 1111\ 1111\ 1111_2

An un­signed in­teger, masked with a mask like this will be able to rep­res­ent num­bers from 0 to 2k12^k-1, in­clus­ive. This hap­pens to ex­actly match the range of rep­res­ent­able in­tegers mod­ulo 2k2 ^ k. This, much like the di­vi­sion op­tim­isa­tion, comes out from how the in­tegers are rep­res­en­ted in bin­ary.

The com­piler will usu­ally re­write these op­er­a­tions to a more ef­fi­cient ver­sion by it­self, how­ever it can only do so when it has the know­ledge about val­ues be­ing spe­cial. In our case, the com­piler would not ne­ces­sar­ily have this in­form­a­tion at its dis­pos­al, so it was ne­ces­sary to ap­ply this trans­form­a­tion manu­ally in many cases.

gcd\operatorname{gcd} and lcd\operatorname{lcd} when one of the op­er­ands is a power of two

gcd(a,b)\operatorname{gcd}(a, b) where either one of aa or bb is a power of two will it­self be 2k2^k for some kk. The num­ber of 22 mul­tiples in an in­teger can be found by count­ing the num­ber of con­sec­ut­ive un­set bits start­ing from the least sig­ni­fic­ant bit. This use­ful prop­erty comes from the bin­ary rep­res­ent­a­tion of un­signed in­tegers. Fur­ther­more lcd(a,b)=abgcd(a,b)\operatorname{lcd}(a, b) = \frac{ab}{\operatorname{gcd}\left(a, b\right)}. To cal­cu­late both gcd(a,b)\operatorname{gcd}(a, b) and lcd(a,b)\operatorname{lcd}(a, b):

let k = a.trailing_zeros().min(b.trailing_zeros());
let gcd = 1 << k;
let lcd = (a * b) >> k;

Un­like the na­ive al­gorithms, this can get the job done in just a few cycles. Even some un­com­mon tar­get ma­chine with no nat­ive in­struc­tion for trailing_zeros will man­age to ex­ecute this in 40 cycles or so. Still some­what faster than a single in­teger di­vi­sion op­er­a­tion, which would be ne­ces­sary to im­ple­ment other al­gorithms.

Mod­u­lar mul­ti­plic­at­ive in­verse when the mod­ulo is a power of two

The high­light: a trick to ef­fi­ciently com­pute a1mod2ka^{-1} \mod 2^k. Nor­mally, mod­u­lar mul­ti­plic­at­ive in­verse re­quires an it­er­at­ive al­gorithm such as the ex­ten­ded Eu­c­lidean al­gorithm. Such al­gorithms in­volve at least a di­vi­sion and a mod­ulo op­er­a­tion in their in­ner loop and re­quire com­par­at­ively many it­er­a­tions to com­plete. As a spe­cial case, a1mod2ka^{-1} \mod 2^k can be com­puted in O(log2log2k)\operatorname{O}(\log_2\log_2k) by us­ing a fairly un­known tech­nique de­scribed in this sci.crypt post. To sum­mar­ize the fi­nal al­gorithm:

  1. Start at a small mod­ulo such as 2i=242^i = 2^4 for which it is cheap to either store the table of in­verses or com­pute them;
  2. Look up in a table (or com­pute) the in­verse g=a1mod2ig = a^{-1} \mod 2^i;
  3. Com­pute g=a1mod2i2=g(2ag)mod2i2g' = a^{-1} \mod 2^{i^2} = g \left(2 - ag\right) \mod 2^{i^2};
  4. If i2>ki^2 > k, a1mod2k=gmod2ka^{-1} \mod 2^k = g' \mod 2^k, oth­er­wise re­peat step 3 with ggg \leftarrow g' and ii2i \leftarrow i^2.

Start­ing with i=4i = 4, com­pu­ta­tion of a1mod2256a ^ {-1} \mod 2^256 needs 2 it­er­a­tions only. Since mag­nitude of align­ments is usu­ally low, for the pur­poses of [T]::align_to at most one it­er­a­tion will usu­ally be ne­ces­sary. Without this al­gorithm [T]::align_to would most likely ended up be­ing en­tirely use­less!

Everything to­gether

These tricks com­bined with com­piler’s own op­tim­isa­tions work to­gether to make [T]::align_to a terse list of cheap in­struc­tions, some­what close to what one would get if writ­ten by hand, in highly op­tim­ised as­sembly. Un­like the tra­di­tional ap­proaches to the prob­lem which in­teg­rate such “split­ting” into the hot loop, com­put­ing the slices up front may en­able mis­cel­laneous op­tim­isa­tions as well (e.g. it­er­a­tion with point­ers is easier to op­tim­ise). The com­piler is able to move com­pu­ta­tion of sep­ar­ate slices around – to where it makes more sense and even re­move the com­pu­ta­tions ne­ces­sary to pro­duce slices which ul­ti­mately end up not be­ing used!

I con­sider the im­ple­ment­a­tion of [T]::align_to a great suc­cess, al­beit not a per­sonal one: the core for­mu­lae and tricks of this al­gorithm were figured out by Chris Mc­Don­ald. I merely im­ple­men­ted them. I’m very proud of this func­tion nev­er­the­less.


  1. C++’s std::align is a close re­l­at­ive to [T]::align_to; [T]::align_to sup­ports a strict su­per­set of func­tion­al­ity.↩︎