Demonstration numeric kernel for the runtime dispatch template.
convolve1d() computes the same full convolution as a simple nested-loop R
definition. For each pair of positions it adds a[i] * b[j] to
out[i + j - 1]. SIMD backends vectorize the inner multiply-add over b
and the shifted output window.