Dot product can be auto-vectorized only with
Possible solutions:
So I wrote implementations of dot product for real and complex numbers.
Complex version is similar. The main loop:
There are gdc and ldc implementations, but ldc implementation was not tested. Source code is available at GitHub.
fast-math
or similar compiler option. fast-math
allows compiler to use associative floating point transformations. Other math functions like exponent can be damaged consequently.
Possible solutions:
- Compile
fast_math
code from other program separately and then link it. This is easy solution. However this is a step back to C. - To introduce a
@fast_math
attribute. This is hard to realize. But I hope this will be done for future compilers. - Do vectorization yourself. In that case you need to realize SIMD accessory functions like unaligned load.
So I wrote implementations of dot product for real and complex numbers.
Code
Dot product for real numbers:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
F dotProduct(F)(in F[] a, in F[] b) @trusted | |
if(is(F == double) || is(F == float)) | |
in | |
{ | |
assert(a.length == b.length); | |
} | |
out(result) | |
{ | |
assert(!result.isNaN); | |
} | |
body | |
{ | |
enum N = MaxVectorSizeof / F.sizeof; | |
alias VP = const(Vector!(F[N]))*; | |
auto ap = a.ptr, bp = b.ptr; | |
const n = a.length; | |
const end = ap + n; | |
const ste = ap + (n & -N); | |
const sbe = ap + (n & -N*4); | |
Vector!(F[N]) s0 = 0, s1 = 0, s2 = 0, s3 = 0; | |
for(; ap < sbe; ap+=4*N, bp+=4*N) | |
{ | |
s0 += load(cast(VP)ap+0) * load(cast(VP)bp+0); | |
s1 += load(cast(VP)ap+1) * load(cast(VP)bp+1); | |
s2 += load(cast(VP)ap+2) * load(cast(VP)bp+2); | |
s3 += load(cast(VP)ap+3) * load(cast(VP)bp+3); | |
} | |
s0 = (s0+s1)+(s2+s3); | |
for(; ap < ste; ap+=N, bp+=N) | |
s0 += load(cast(VP)ap+0) * load(cast(VP)bp+0); | |
//reduce to scalar | |
F s = toScalarSum(s0); | |
for(; ap < end; ap++, bp++) | |
s += ap[0] * bp[0]; | |
return s; | |
} |
Complex version is similar. The main loop:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
for(; ap < sbe; ap+=4*M, bp+=4*M) | |
{ | |
auto a0 = load(cast(VP)ap+0); | |
auto a1 = load(cast(VP)ap+1); | |
auto a2 = load(cast(VP)ap+2); | |
auto a3 = load(cast(VP)ap+3); | |
auto b0 = load(cast(VP)bp+0); | |
auto b1 = load(cast(VP)bp+1); | |
auto b2 = load(cast(VP)bp+2); | |
auto b3 = load(cast(VP)bp+3); | |
s0 += a0 * b0; | |
s1 += a1 * b1; | |
s2 += a2 * b2; | |
s3 += a3 * b3; | |
//swap real and imaginary parts | |
a0 = swapReIm(a0); | |
a1 = swapReIm(a1); | |
a2 = swapReIm(a2); | |
a3 = swapReIm(a3); | |
r0 += a0 * b0; | |
r1 += a1 * b1; | |
r2 += a2 * b2; | |
r3 += a3 * b3; | |
} |
There are gdc and ldc implementations, but ldc implementation was not tested. Source code is available at GitHub.
Benchmarks
Processor | Intel i5-4570, Haswell |
Instruction set | AVX2 |
System | Ubuntu 13.10 |
Compiler | GDC-4.8.1 |
Compiler flags | -march=native -fno-bounds-check -frename-registers -frelease -O3 |