Saturday, August 17, 2013

SIMD implementation of dot-product.
Benchmarks

Dot product can be auto-vectorized only with fast-math or similar compiler option. fast-math allows compiler to use associative floating point transformations. Other math functions like exponent can be damaged consequently.
Possible solutions:
  1. Compile fast_math code from other program separately and then link it. This is easy solution. However this is a step back to C.
  2. To introduce a @fast_math attribute. This is hard to realize. But I hope this will be done for future compilers.
  3. Do vectorization yourself. In that case you need to realize SIMD accessory functions like unaligned load.

So I wrote implementations of dot product for real and complex numbers.

Code

Dot product for real numbers:
F dotProduct(F)(in F[] a, in F[] b) @trusted
if(is(F == double) || is(F == float))
in
{
assert(a.length == b.length);
}
out(result)
{
assert(!result.isNaN);
}
body
{
enum N = MaxVectorSizeof / F.sizeof;
alias VP = const(Vector!(F[N]))*;
auto ap = a.ptr, bp = b.ptr;
const n = a.length;
const end = ap + n;
const ste = ap + (n & -N);
const sbe = ap + (n & -N*4);
Vector!(F[N]) s0 = 0, s1 = 0, s2 = 0, s3 = 0;
for(; ap < sbe; ap+=4*N, bp+=4*N)
{
s0 += load(cast(VP)ap+0) * load(cast(VP)bp+0);
s1 += load(cast(VP)ap+1) * load(cast(VP)bp+1);
s2 += load(cast(VP)ap+2) * load(cast(VP)bp+2);
s3 += load(cast(VP)ap+3) * load(cast(VP)bp+3);
}
s0 = (s0+s1)+(s2+s3);
for(; ap < ste; ap+=N, bp+=N)
s0 += load(cast(VP)ap+0) * load(cast(VP)bp+0);
//reduce to scalar
F s = toScalarSum(s0);
for(; ap < end; ap++, bp++)
s += ap[0] * bp[0];
return s;
}

Complex version is similar. The main loop:
for(; ap < sbe; ap+=4*M, bp+=4*M)
{
auto a0 = load(cast(VP)ap+0);
auto a1 = load(cast(VP)ap+1);
auto a2 = load(cast(VP)ap+2);
auto a3 = load(cast(VP)ap+3);
auto b0 = load(cast(VP)bp+0);
auto b1 = load(cast(VP)bp+1);
auto b2 = load(cast(VP)bp+2);
auto b3 = load(cast(VP)bp+3);
s0 += a0 * b0;
s1 += a1 * b1;
s2 += a2 * b2;
s3 += a3 * b3;
//swap real and imaginary parts
a0 = swapReIm(a0);
a1 = swapReIm(a1);
a2 = swapReIm(a2);
a3 = swapReIm(a3);
r0 += a0 * b0;
r1 += a1 * b1;
r2 += a2 * b2;
r3 += a3 * b3;
}

There are gdc and ldc implementations, but ldc implementation was not tested. Source code is available at GitHub.

Benchmarks

Processor Intel i5-4570, Haswell
Instruction set AVX2
System Ubuntu 13.10
Compiler GDC-4.8.1
Compiler flags -march=native -fno-bounds-check -frename-registers -frelease -O3