admin管理员组文章数量:1355697
I implemented a simple 2d separable convolution code:
void SeparableConvolution2D(const float * restrict mI, float * restrict mO, float * restrict mB, int numRows, int numCols, const float * restrict vKRow, int rowKernelLen, const float * restrict vKCol, int colKernelLen)
{
// mI (numRows x numCols) - The input.
// mO (numRowsO x numColsO) - The output.
// mB (numColsO x numRows) - The buffer.
// Intermediate buffer (transposed version of the row filtered image)
// - Parallelization overhead is too big.
// - GCC native SIMD is better than OpenMP.
int rowKernelRad = rowKernelLen / 2; // Radius
int colKernelRad = colKernelLen / 2; // Radius
// Output dimensions for "Valid" convolution
int numRowsO = numRows - (colKernelLen - 1);
int numColsO = numCols - (rowKernelLen - 1);
// Apply 1D filter along rows
for (int r = 0; r < numRows; r++)
{
#pragma omp simd
for (int c = 0; c < numColsO; c++)
{
float sum = 0.0f;
for (int k = 0; k < rowKernelLen; k++)
{
sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
}
mB[c * numRows + r] = sum; // Store row result
}
}
// Apply 1D filter along columns
// The data in `mB` is transposed, hence again working along the rows
for (int r = 0; r < numColsO; r++)
{
#pragma omp simd
for (int c = 0; c < numRowsO; c++)
{
float sum = 0.0f;
for (int k = 0; k < colKernelLen; k++)
{
sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
}
mO[c * numColsO + r] = sum; // Store final result
}
}
}
I want it to be SIMD efficient, yet portable.
Hence I don't want to use any SIMD intrinsic and also no added dependency to the code.
I tried using OpenMP SIMD though not with much success.
The objective is to vectorize the loop over c
.
Yet it seems the compiler only vectorize the calculation of the sum
variable.
Is there a way to achieve vectorization on the c
loop?
Any way to hint the compiler? Any OpenMP SIMD magic?
The target compiler is GCC.
Optional Transformation
For instance, one transformation I thought is by explicit unrolling.
For instance, the first iteration would become:
for (int r = 0; r < numRows; r++)
{
vI = mI + (r * numCols); // Pointer to the current row
// #pragma omp simd
#pragma GCC ivdep
#pragma vector always
for (int c = 0; c + 4 < numColsO; c += 4)
{
// Load 4 pixels from the input image
FLOAT_TYPE * vI0 = &vI[c + rowKernelLen1];
FLOAT_TYPE * vI1 = &vI[c + rowKernelLen1 + 1];
FLOAT_TYPE * vI2 = &vI[c + rowKernelLen1 + 2];
FLOAT_TYPE * vI3 = &vI[c + rowKernelLen1 + 3];
// Apply the kernel to each pixel
FLOAT_TYPE sum0 = 0.0f;
FLOAT_TYPE sum1 = 0.0f;
FLOAT_TYPE sum2 = 0.0f;
FLOAT_TYPE sum3 = 0.0f;
for (int k = 0; k < rowKernelLen; k++)
{
sum0 += vI0[- k] * vKRow[k];
sum1 += vI1[- k] * vKRow[k];
sum2 += vI2[- k] * vKRow[k];
sum3 += vI3[- k] * vKRow[k];
}
mB[c * numRows + r] = sum0; // Store row result
mB[(c + 1) * numRows + r] = sum1; // Store row result
mB[(c + 2) * numRows + r] = sum2; // Store row result
mB[(c + 3) * numRows + r] = sum3; // Store row result
}
}
The motivation is to "hint" the compiler it should broadcast the multiplication by the kernel value, vKRow[k]
over multiple image pixels. Will that achieve this?
Update
I marked @JérômeRichard's answer though the actual exact code I use is the one in my answer. Yet @JérômeRichard's answer is what gave me access to ISPC which is the tool to force the correct SIMD in this case.
I implemented a simple 2d separable convolution code:
void SeparableConvolution2D(const float * restrict mI, float * restrict mO, float * restrict mB, int numRows, int numCols, const float * restrict vKRow, int rowKernelLen, const float * restrict vKCol, int colKernelLen)
{
// mI (numRows x numCols) - The input.
// mO (numRowsO x numColsO) - The output.
// mB (numColsO x numRows) - The buffer.
// Intermediate buffer (transposed version of the row filtered image)
// - Parallelization overhead is too big.
// - GCC native SIMD is better than OpenMP.
int rowKernelRad = rowKernelLen / 2; // Radius
int colKernelRad = colKernelLen / 2; // Radius
// Output dimensions for "Valid" convolution
int numRowsO = numRows - (colKernelLen - 1);
int numColsO = numCols - (rowKernelLen - 1);
// Apply 1D filter along rows
for (int r = 0; r < numRows; r++)
{
#pragma omp simd
for (int c = 0; c < numColsO; c++)
{
float sum = 0.0f;
for (int k = 0; k < rowKernelLen; k++)
{
sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
}
mB[c * numRows + r] = sum; // Store row result
}
}
// Apply 1D filter along columns
// The data in `mB` is transposed, hence again working along the rows
for (int r = 0; r < numColsO; r++)
{
#pragma omp simd
for (int c = 0; c < numRowsO; c++)
{
float sum = 0.0f;
for (int k = 0; k < colKernelLen; k++)
{
sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
}
mO[c * numColsO + r] = sum; // Store final result
}
}
}
I want it to be SIMD efficient, yet portable.
Hence I don't want to use any SIMD intrinsic and also no added dependency to the code.
I tried using OpenMP SIMD though not with much success.
The objective is to vectorize the loop over c
.
Yet it seems the compiler only vectorize the calculation of the sum
variable.
Is there a way to achieve vectorization on the c
loop?
Any way to hint the compiler? Any OpenMP SIMD magic?
The target compiler is GCC.
Optional Transformation
For instance, one transformation I thought is by explicit unrolling.
For instance, the first iteration would become:
for (int r = 0; r < numRows; r++)
{
vI = mI + (r * numCols); // Pointer to the current row
// #pragma omp simd
#pragma GCC ivdep
#pragma vector always
for (int c = 0; c + 4 < numColsO; c += 4)
{
// Load 4 pixels from the input image
FLOAT_TYPE * vI0 = &vI[c + rowKernelLen1];
FLOAT_TYPE * vI1 = &vI[c + rowKernelLen1 + 1];
FLOAT_TYPE * vI2 = &vI[c + rowKernelLen1 + 2];
FLOAT_TYPE * vI3 = &vI[c + rowKernelLen1 + 3];
// Apply the kernel to each pixel
FLOAT_TYPE sum0 = 0.0f;
FLOAT_TYPE sum1 = 0.0f;
FLOAT_TYPE sum2 = 0.0f;
FLOAT_TYPE sum3 = 0.0f;
for (int k = 0; k < rowKernelLen; k++)
{
sum0 += vI0[- k] * vKRow[k];
sum1 += vI1[- k] * vKRow[k];
sum2 += vI2[- k] * vKRow[k];
sum3 += vI3[- k] * vKRow[k];
}
mB[c * numRows + r] = sum0; // Store row result
mB[(c + 1) * numRows + r] = sum1; // Store row result
mB[(c + 2) * numRows + r] = sum2; // Store row result
mB[(c + 3) * numRows + r] = sum3; // Store row result
}
}
The motivation is to "hint" the compiler it should broadcast the multiplication by the kernel value, vKRow[k]
over multiple image pixels. Will that achieve this?
Update
I marked @JérômeRichard's answer though the actual exact code I use is the one in my answer. Yet @JérômeRichard's answer is what gave me access to ISPC which is the tool to force the correct SIMD in this case.
Share Improve this question edited yesterday Royi asked Mar 30 at 6:59 RoyiRoyi 5,0457 gold badges51 silver badges73 bronze badges 10 | Show 5 more comments3 Answers
Reset to default 4The objective is to vectorize the loop over c.
This is the standard way of doing that in OpenMP:
// Apply 1D filter along rows
for (int r = 0; r < numRows; r++)
{
#pragma omp simd
for (int c = 0; c < numColsO; c++)
{
float sum = 0.0f;
for (int k = 0; k < rowKernelLen; k++)
sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
mB[c * numRows + r] = sum;
}
}
In practice though, most compilers cannot vectorise the loop because this is too complicated to do currently for them. More specifically, this is because of the inner loop with a runtime-defined number of iterations as pointed out by Homer512 in comments. For example, ICC 2021.10 succeeds to vectorise the code correctly (i.e. as requested); ICX 2024.0 succeeds to do that but it seems not to vectorise it correctly as the warning raised indicates; GCC 14.2 and Clang 20.1 fail to (efficiently) vectorise the code in this case. See on Godbolt.
Yet it seems the compiler only vectorize the calculation of the sum variable.
This is because AFAIK GCC internally rely on auto-vectorisation and does not really/fully implement (most of) the SIMD directive of OpenMP. Put it shortly: it does not really care about your OpenMP hints.
One cheat is sometime to use omp declare simd
and call the SIMD-friendly function so to help the compiler to know how to vectorise the loop. But this does not appear to work here (see my try on Godbolt).
Even if it would care, I do not expect the GCC optimiser to do such kind of transformation of the loop (as opposed to ICC which is know to do that pretty well, even sometimes automatically).
Overall, this is the problem with (OpenMP) compiler hints: in many cases, not implementing anything is completely compliant to the standard which does not require the implementation to actually generate a vectorised code.
If you want stronger guarantee on vectorisation while keeping a portable code, then you should pick one of the following option:
- use SIMD-friendly languages like ISPC (or OpenCL/SYCL mainly targetting accelerators);
- use a Domain Specific Language (DSL) designed for generating a fast code for such a task like Halide;
- use an external C or C++ SIMD library like Google's Highway, XSIMD, etc.;
- use the experimental C++ SIMD parallelism V2 TS (draft available here). This TS is supposed to be integrated in the future C++ standard (probably not C++26 but possibly C++29). C++ TS milestones are available here.
Adapting the code so to help the optimiser to generate a SIMD code, like splitting the loop in tiles with a fixed-size known at compile time (as suggested by Homer512) increases the probability for the compiler to vectorise the code, but this is still far from being guaranteed and still brittle (it may fails on another version of the same compiler and I often seen that). Using OpenMP directive tends to improve the situation but it is still rather brittle in non-trivial cases. In Clang, this solution can sometimes make the code significantly less efficient due to bad choices made by the SLP vectoriser. Moreover, this solution tends to make the code significantly more complicated in practice because of additional loops and temporary buffers for tiling and also because of duplicated code due to the loops computing the last items not fitting in SIMD registers (similar to what we do with intrinsics on SIMD instruction sets not supporting masking like SSE/AVX/Neon as opposed to AVX-512/SVE).
There is another thing to consider: mO[c * numRowsO + r] = sum;
cannot be vectorised efficiently on all mainstream x86/ARM CPU yet when the c
-based loop is vectorised. Indeed, the memory stores are stridded ones and not contiguous. There is a scatter instruction since AVX2 for such memory access pattern but it is very inefficient on all CPUs so far (often even slower than scalar stores). AFAIK, on ARM, SVE also supports this, but most ARM CPUs do not support SVE yet (some ARM server CPU do like the latest Graviton CPUs). Even when it is implemented rather efficiently, such scatter stores are expensive because they operate on a lot of cache lines simultaneously. It is certainly more efficient to write data contiguously in a temporary buffer and only then transpose it.
Example using ISPC
For a CPU code, I would personally pick ISPC which is often able to do such kind of non-trivial vectorisation pretty well. If you are curious, here is a naive ISPC implementation I quickly wrote (very close to your C code):
void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[],
uniform int numRows, uniform int numCols,
const uniform float vKRow[], uniform int rowKernelLen,
const uniform float vKCol[], uniform int colKernelLen)
{
uniform int rowKernelRad = rowKernelLen / 2;
uniform int colKernelRad = colKernelLen / 2;
uniform int numRowsO = numRows - (colKernelLen - 1);
uniform int numColsO = numCols - (rowKernelLen - 1);
// Apply 1D filter along rows
for (uniform int r = 0; r < numRows; r++)
{
foreach (c = 0...numColsO)
{
float sum = 0.0f;
#pragma nounroll
for (uniform int k = 0; k < rowKernelLen; k++)
sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
mB[c * numRows + r] = sum;
}
}
// Apply 1D filter along columns
for (uniform int r = 0; r < numColsO; r++)
{
foreach (c = 0...numRowsO)
{
float sum = 0.0f;
#pragma nounroll
for (uniform int k = 0; k < colKernelLen; k++)
sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
mO[c * numColsO + r] = sum;
}
}
}
This implementation seems to generate the expected SIMD assembly code (see on Godbold):
; Compute 8 items per iteration using AVX2
.LBB0_7:
movsxd rbx, ebx
vbroadcastss ymm6, dword ptr [r9 + 4*rax]
vfmadd231ps ymm5, ymm6, ymmword ptr [rdi + rbx]
inc rax
add ebx, -4
cmp r12, rax
jne .LBB0_7
Note that uniform
variable can be seen as scalar shared amongst all SIMD lanes and varying
are local to each lane. Variables are varying
one by default (which can be harmful for performance sometimes). I advise you to read the ISPC manual for more information about this. The programming model is rather close to the one of OpenCL and CUDA.
By the way, ISPC also warns us about the inefficient scatter stores (mentioned above). The code should be optimised to avoid these horrible scatter stores. That being said, doing an efficient portable 2D transposition in ISPC can be quite challenging (like with most alternatives actually). You can still try to do a naive transposition like in C (same code but with uniform
added before all variables) and see if it is fast enough to you.
Note the above assembly code is clearly sub-optimal because of the latency of the vfmadd231ps
instruction (typically 3-4 cycles) and the dependency on the previous iteration. That being said, ISPC does exactly what we ask for a loop adding numbers in a SIMD-friendly way. Fortunately, ISPC supports a feature called double-pumping designed to mitigate such latency issue and better use SIMD units (see on Godbolt):
; Compute 16 items per iteration using AVX2
.LBB0_26:
movsxd r13, r13d
vbroadcastss ymm8, dword ptr [r9 + 4*r8]
vfmadd231ps ymm6, ymm8, ymmword ptr [rdi + r13 + 32]
vfmadd231ps ymm7, ymm8, ymmword ptr [rdi + r13]
inc r8
add r13d, -4
cmp r12, r8
jne .LBB0_26
This is still sub-optimal but twice better. On CPUs supporting AVX-512, quad-pumping is available resulting in a pretty-efficient SIMD assembly code without even changing the code! It should be significantly faster than a scalar implementation or even the one of GCC (except for huge kernel filters). The code is also easy to read and maintain.
Update:
After discussing with @chtz in comments, we can operate on a transposed mB
(assuming this variable is only meant to be used as a temporary buffer) and change the vectorisation axis in the second filtering step while also swapping the loops for sake of performance. The code should look like this:
void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[],
uniform int numRows, uniform int numCols,
const uniform float vKRow[], uniform int rowKernelLen,
const uniform float vKCol[], uniform int colKernelLen)
{
uniform int numRowsO = numRows - (colKernelLen - 1);
uniform int numColsO = numCols - (rowKernelLen - 1);
// Apply 1D filter along rows
for (uniform int r = 0; r < numRows; r++)
{
foreach (c = 0...numColsO)
{
float sum = 0.0f;
#pragma nounroll
for (uniform int k = 0; k < rowKernelLen; k++)
sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
mB[r * numCols + c] = sum;
}
}
// Apply 1D filter along columns
for (uniform int c = 0; c < numRowsO; c++)
{
foreach (r = 0...numColsO)
{
float sum = 0.0f;
#pragma nounroll
for (uniform int k = 0; k < colKernelLen; k++)
sum += mB[((c - k + colKernelLen - 1) * numCols) + r] * vKCol[k];
mO[c * numColsO + r] = sum;
}
}
}
I don’t know how to do that portably. Still, when I recently needed efficient separable convolution, I have used the following tactics. Maybe you’ll be able to implement the following algorithm in a way modern compilers will be able to auto-vectorize.
The answer assumes your data is a 2D array stored in row-major memory layout, and the SIMD width is 8 (I had FP32 data, and targeted AVX SIMD).
Before starting the convolution, allocate a buffer for each OS thread. The size of the buffer = input width × 8, the buffer should be aligned by at least SIMD width floats, ideally by 64 bytes (cache line). The hope is, these buffers should fit in the fast L2 in-core cashes (e.g. my Zen4 CPU has 1MB L2 cache per core) which deliver very high throughput.
Split the height of the array into the panels of height equal to SIMD width. Dispatch these panels to different OS threads, an easy way to do that is parallel for
in OpenMP.
The processing of each panel is a two steps process.
1. The first loop loads up to 8 rows from the source 2D array, stores into the temporary buffer with column major memory layout. If the panel is the last one of the input array, zero pad each column in the output buffer. An efficient way to transpose with AVX is loading data 4 floats at a time, and transpose two 4x4 blocks stored in 4 vectors. But note the performance of this part is not too important. The majority of time and bandwidth is spent in the second part because much more operations per element, you might be fine without vectorizing this part at all.
2. The main loop is by columns of the input, each iteration computes 8-tall column of the output. Use _mm256_broadcast_ss
to load numbers from the convolution kernel; assuming your kernel is relatively small, the numbers should stay in L1D cache. Use _mm256_load_ps
to load source numbers from the temporary buffer (aligned full-vector loads), _mm256_fmadd_ps
to multiply and accumulate. After you processed the complete kernel and got the vector with 8 output numbers, store into the output 2D array in transposed order which is a single _mm256_storeu_ps
because you have a vector with the column.
Because the output is transposed, the second pass of the convolution is the call to the same function which implements the first pass. And because two transposes cancel each other, the output memory layout will match the input.
Couple more things.
If you are doing Gaussian blur or similar, start the inner loop in the second step from the outer elements of the kernel, and move towards the center of the kernel. The reason is numerical precision: it’s beneficial to accumulate smaller numbers first.
You going to need a large temporary 2D array for the output of the first convolution pass, same as input of the second convolution pass. If your input 2D array is W×H, the temporary array needs to be H×W.
If you don’t have that many elements in the input, don’t bother with multithreading. In this case you just need a single temporary buffer, and a normal for loop.
If you need that for 3D arrays, you can still reuse the same function for all 3 passes of the convolution if you transpose cyclically i.e. the X pass of the convolution loads source data in the conventional XYZ layout and stores YZX, Y pass loads YZX and stores ZXY, the final Z pass loads ZXY and stores XYZ which matches the input.
As suggested by @JérômeRichard The Intel Implicit SPMD Program Compiler (ISPC) is the tool to force such computations.
I will analyze 2 variations.
In this variations, for simplicity, I will assume each SIMD worker is 4 elements wide.
The original code intention can be used with ISPC as following:
export void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[],
uniform int numRows, uniform int numCols,
const uniform float vKRow[], uniform int rowKernelLen,
const uniform float vKCol[], uniform int colKernelLen)
{
uniform int rowKernelRad = rowKernelLen / 2;
uniform int colKernelRad = colKernelLen / 2;
uniform int numRowsO = numRows - (colKernelLen - 1);
uniform int numColsO = numCols - (rowKernelLen - 1);
// Apply 1D filter along rows
for (uniform int r = 0; r < numRows; r++)
{
foreach (c = 0...numColsO)
{
float sum = 0.0f;
#pragma nounroll
for (uniform int k = 0; k < rowKernelLen; k++)
sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
mB[c * numRows + r] = sum;
}
}
// Apply 1D filter along columns
for (uniform int r = 0; r < numColsO; r++)
{
foreach (c = 0...numRowsO)
{
float sum = 0.0f;
#pragma nounroll
for (uniform int k = 0; k < colKernelLen; k++)
sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
mO[c * numColsO + r] = sum;
}
}
}
Geometrically the first loop, working along each row split the work by SIMD workers.
Each SIMD worker loads data from the array with a contiguous load (Though not necessarily aligned).
Each SIMD worker broadcasts a multiplication with vKRow
once applies all summation of the convolution it uses scattered store (Transposed) onto mB
.
Assuming the length of vKRow
is 3, each worker loads 3 times:
The 2nd loop does the same just working on mB
as its input and mO
as the output.
A 2nd variation of the ISPC code is given by:
export void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[],
uniform int numRows, uniform int numCols,
const uniform float vKRow[], uniform int rowKernelLen,
const uniform float vKCol[], uniform int colKernelLen)
{
uniform int numRowsO = numRows - (colKernelLen - 1);
uniform int numColsO = numCols - (rowKernelLen - 1);
// Apply 1D filter along rows
for (uniform int r = 0; r < numRows; r++)
{
foreach (c = 0...numColsO)
{
float sum = 0.0f;
#pragma nounroll
for (uniform int k = 0; k < rowKernelLen; k++)
sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
mB[r * numColsO + c] = sum;
}
}
// Apply 1D filter along columns
foreach (c = 0...numColsO)
{
for (uniform int r = 0; r < numRowsO; r++)
{
float sum = 0.0f;
#pragma nounroll
for (uniform int k = 0; k < colKernelLen; k++)
sum += mB[(r - k + colKernelLen - 1) * numColsO + c] * vKCol[k];
mO[r * numColsO + c] = sum;
}
}
}
Its first loop, along each row, is the same as above with the only difference the data is stored in a contiguous manner, with no transposition, in mB
.
The 2nd loop is a bit different. As it should apply the convolution along each column, which is not contiguous, the SIMD works a bit differently.
In this case each worker loads aligned contiguous load from each row and multiply it by the corresponding value of vKCol
:
This formulation avoids scattered store.
I don't know how to analyze the output code yet on my tests the 2nd variation is a bit faster.
本文标签: cOptimize a Separable Convolution for SIMD Friendly and EfficiencyStack Overflow
版权声明:本文标题:c - Optimize a Separable Convolution for SIMD Friendly and Efficiency - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743994720a2572777.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
-march=native
will tell the compiler to use all the functionality of your chip. If you want to target a broader set of chips, you need to specify specific functionality you want to allow. If you just give-O3
then GCC will likely not use any SIMD instructions, as not all chips support them. – Cris Luengo Commented Mar 30 at 13:24mB
. Besides the point why you think that is necessary, are you even sure your code is correct? Your second loop accessesmO[(numRowsO-1)*numRowsO + numColsO-1]
in it's last iteration which does not look correct. – chtz Commented Mar 31 at 12:27numRowsO
is much larger thannumColsO
(two times as large should be sufficient). Or maybe you just made some simple copy+paste errors when translating from Matlab to C. You can switch the order of ther
andc
-loop of the second pass, to match the data ordering. Generally, I'd say it is much more important that storage goes to contiguous data, which is neither the case formB[c * numRows + r]
nor formO[c * numRowsO + r]
ifc
is the inner loop counter. – chtz Commented Mar 31 at 13:35