admin管理员组

文章数量:1355697

I implemented a simple 2d separable convolution code:

void SeparableConvolution2D(const float * restrict mI, float * restrict mO, float * restrict mB, int numRows, int numCols, const float * restrict vKRow, int rowKernelLen, const float * restrict vKCol, int colKernelLen) 
{
    // mI (numRows x numCols) - The input.
    // mO (numRowsO x numColsO) - The output.
    // mB (numColsO x numRows) - The buffer.
    // Intermediate buffer (transposed version of the row filtered image)
    // - Parallelization overhead is too big.
    // - GCC native SIMD is better than OpenMP.

    int rowKernelRad = rowKernelLen / 2; // Radius
    int colKernelRad = colKernelLen / 2; // Radius
    
    // Output dimensions for "Valid" convolution
    int numRowsO = numRows - (colKernelLen - 1);
    int numColsO = numCols - (rowKernelLen - 1);
    
    // Apply 1D filter along rows
    for (int r = 0; r < numRows; r++) 
    {
        #pragma omp simd
        for (int c = 0; c < numColsO; c++)
        {
            float sum = 0.0f;
            for (int k = 0; k < rowKernelLen; k++) 
            {
                sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
            }
            mB[c * numRows + r] = sum; // Store row result
        }
    }
    
    // Apply 1D filter along columns
    // The data in `mB` is transposed, hence again working along the rows
    for (int r = 0; r < numColsO; r++) 
    {
        #pragma omp simd
        for (int c = 0; c < numRowsO; c++) 
        {
            float sum = 0.0f;
            for (int k = 0; k < colKernelLen; k++) 
            {
                sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
            }
            mO[c * numColsO + r] = sum; // Store final result
        }
    }

}

I want it to be SIMD efficient, yet portable.
Hence I don't want to use any SIMD intrinsic and also no added dependency to the code.
I tried using OpenMP SIMD though not with much success.

The objective is to vectorize the loop over c.
Yet it seems the compiler only vectorize the calculation of the sum variable.

Is there a way to achieve vectorization on the c loop?
Any way to hint the compiler? Any OpenMP SIMD magic?

The target compiler is GCC.

Optional Transformation

For instance, one transformation I thought is by explicit unrolling.
For instance, the first iteration would become:

for (int r = 0; r < numRows; r++) 
    {
        vI = mI + (r * numCols); // Pointer to the current row
        // #pragma omp simd
        #pragma GCC ivdep
        #pragma vector always
        for (int c = 0; c + 4 < numColsO; c += 4) 
        {
            // Load 4 pixels from the input image
            FLOAT_TYPE * vI0 = &vI[c + rowKernelLen1];
            FLOAT_TYPE * vI1 = &vI[c + rowKernelLen1 + 1];
            FLOAT_TYPE * vI2 = &vI[c + rowKernelLen1 + 2];
            FLOAT_TYPE * vI3 = &vI[c + rowKernelLen1 + 3];

            // Apply the kernel to each pixel
            FLOAT_TYPE sum0 = 0.0f;
            FLOAT_TYPE sum1 = 0.0f;
            FLOAT_TYPE sum2 = 0.0f;
            FLOAT_TYPE sum3 = 0.0f;

            for (int k = 0; k < rowKernelLen; k++) 
            {
                sum0 += vI0[- k] * vKRow[k];
                sum1 += vI1[- k] * vKRow[k];
                sum2 += vI2[- k] * vKRow[k];
                sum3 += vI3[- k] * vKRow[k];
            }
            
            mB[c * numRows + r]       = sum0; // Store row result
            mB[(c + 1) * numRows + r] = sum1; // Store row result
            mB[(c + 2) * numRows + r] = sum2; // Store row result
            mB[(c + 3) * numRows + r] = sum3; // Store row result
        }
    }

The motivation is to "hint" the compiler it should broadcast the multiplication by the kernel value, vKRow[k] over multiple image pixels. Will that achieve this?

Update

I marked @JérômeRichard's answer though the actual exact code I use is the one in my answer. Yet @JérômeRichard's answer is what gave me access to ISPC which is the tool to force the correct SIMD in this case.

I implemented a simple 2d separable convolution code:

void SeparableConvolution2D(const float * restrict mI, float * restrict mO, float * restrict mB, int numRows, int numCols, const float * restrict vKRow, int rowKernelLen, const float * restrict vKCol, int colKernelLen) 
{
    // mI (numRows x numCols) - The input.
    // mO (numRowsO x numColsO) - The output.
    // mB (numColsO x numRows) - The buffer.
    // Intermediate buffer (transposed version of the row filtered image)
    // - Parallelization overhead is too big.
    // - GCC native SIMD is better than OpenMP.

    int rowKernelRad = rowKernelLen / 2; // Radius
    int colKernelRad = colKernelLen / 2; // Radius
    
    // Output dimensions for "Valid" convolution
    int numRowsO = numRows - (colKernelLen - 1);
    int numColsO = numCols - (rowKernelLen - 1);
    
    // Apply 1D filter along rows
    for (int r = 0; r < numRows; r++) 
    {
        #pragma omp simd
        for (int c = 0; c < numColsO; c++)
        {
            float sum = 0.0f;
            for (int k = 0; k < rowKernelLen; k++) 
            {
                sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
            }
            mB[c * numRows + r] = sum; // Store row result
        }
    }
    
    // Apply 1D filter along columns
    // The data in `mB` is transposed, hence again working along the rows
    for (int r = 0; r < numColsO; r++) 
    {
        #pragma omp simd
        for (int c = 0; c < numRowsO; c++) 
        {
            float sum = 0.0f;
            for (int k = 0; k < colKernelLen; k++) 
            {
                sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
            }
            mO[c * numColsO + r] = sum; // Store final result
        }
    }

}

I want it to be SIMD efficient, yet portable.
Hence I don't want to use any SIMD intrinsic and also no added dependency to the code.
I tried using OpenMP SIMD though not with much success.

The objective is to vectorize the loop over c.
Yet it seems the compiler only vectorize the calculation of the sum variable.

Is there a way to achieve vectorization on the c loop?
Any way to hint the compiler? Any OpenMP SIMD magic?

The target compiler is GCC.

Optional Transformation

For instance, one transformation I thought is by explicit unrolling.
For instance, the first iteration would become:

for (int r = 0; r < numRows; r++) 
    {
        vI = mI + (r * numCols); // Pointer to the current row
        // #pragma omp simd
        #pragma GCC ivdep
        #pragma vector always
        for (int c = 0; c + 4 < numColsO; c += 4) 
        {
            // Load 4 pixels from the input image
            FLOAT_TYPE * vI0 = &vI[c + rowKernelLen1];
            FLOAT_TYPE * vI1 = &vI[c + rowKernelLen1 + 1];
            FLOAT_TYPE * vI2 = &vI[c + rowKernelLen1 + 2];
            FLOAT_TYPE * vI3 = &vI[c + rowKernelLen1 + 3];

            // Apply the kernel to each pixel
            FLOAT_TYPE sum0 = 0.0f;
            FLOAT_TYPE sum1 = 0.0f;
            FLOAT_TYPE sum2 = 0.0f;
            FLOAT_TYPE sum3 = 0.0f;

            for (int k = 0; k < rowKernelLen; k++) 
            {
                sum0 += vI0[- k] * vKRow[k];
                sum1 += vI1[- k] * vKRow[k];
                sum2 += vI2[- k] * vKRow[k];
                sum3 += vI3[- k] * vKRow[k];
            }
            
            mB[c * numRows + r]       = sum0; // Store row result
            mB[(c + 1) * numRows + r] = sum1; // Store row result
            mB[(c + 2) * numRows + r] = sum2; // Store row result
            mB[(c + 3) * numRows + r] = sum3; // Store row result
        }
    }

The motivation is to "hint" the compiler it should broadcast the multiplication by the kernel value, vKRow[k] over multiple image pixels. Will that achieve this?

Update

I marked @JérômeRichard's answer though the actual exact code I use is the one in my answer. Yet @JérômeRichard's answer is what gave me access to ISPC which is the tool to force the correct SIMD in this case.

Share Improve this question edited yesterday Royi asked Mar 30 at 6:59 RoyiRoyi 5,0457 gold badges51 silver badges73 bronze badges 10
  • 2 You can't do this with a variable loop counter for the inner loop. If you make a specialized version with a fixed loop count and place the simd pragma on the outer loop, you may have more luck. For convolutions I'd advise to make specialized versions via templates or macro tricks for typical small window sizes like 3x3 and 5x5; then use a dispatch function to choose the implementation at run time; with a dynamic fallback for large windows – Homer512 Commented Mar 30 at 11:36
  • @Homer512, I am not after the very last bit of performance. Just to optimize the SIMD calculation for the case above. Could you show a simple example of what you mean as an answer? Thank You. – Royi Commented Mar 30 at 11:51
  • What compiler flags are you using? -march=native will tell the compiler to use all the functionality of your chip. If you want to target a broader set of chips, you need to specify specific functionality you want to allow. If you just give -O3 then GCC will likely not use any SIMD instructions, as not all chips support them. – Cris Luengo Commented Mar 30 at 13:24
  • 1 I already made a comment in Jérôme's answer about transposing mB. Besides the point why you think that is necessary, are you even sure your code is correct? Your second loop accesses mO[(numRowsO-1)*numRowsO + numColsO-1] in it's last iteration which does not look correct. – chtz Commented Mar 31 at 12:27
  • 1 @Royi can you test this again with an image where numRowsO is much larger than numColsO (two times as large should be sufficient). Or maybe you just made some simple copy+paste errors when translating from Matlab to C. You can switch the order of the r and c-loop of the second pass, to match the data ordering. Generally, I'd say it is much more important that storage goes to contiguous data, which is neither the case for mB[c * numRows + r] nor for mO[c * numRowsO + r] if c is the inner loop counter. – chtz Commented Mar 31 at 13:35
 |  Show 5 more comments

3 Answers 3

Reset to default 4

The objective is to vectorize the loop over c.

This is the standard way of doing that in OpenMP:

    // Apply 1D filter along rows
    for (int r = 0; r < numRows; r++) 
    {
        #pragma omp simd
        for (int c = 0; c < numColsO; c++)
        {
            float sum = 0.0f;
            for (int k = 0; k < rowKernelLen; k++) 
                sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
            mB[c * numRows + r] = sum;
        }
    }

In practice though, most compilers cannot vectorise the loop because this is too complicated to do currently for them. More specifically, this is because of the inner loop with a runtime-defined number of iterations as pointed out by Homer512 in comments. For example, ICC 2021.10 succeeds to vectorise the code correctly (i.e. as requested); ICX 2024.0 succeeds to do that but it seems not to vectorise it correctly as the warning raised indicates; GCC 14.2 and Clang 20.1 fail to (efficiently) vectorise the code in this case. See on Godbolt.

Yet it seems the compiler only vectorize the calculation of the sum variable.

This is because AFAIK GCC internally rely on auto-vectorisation and does not really/fully implement (most of) the SIMD directive of OpenMP. Put it shortly: it does not really care about your OpenMP hints.

One cheat is sometime to use omp declare simd and call the SIMD-friendly function so to help the compiler to know how to vectorise the loop. But this does not appear to work here (see my try on Godbolt).

Even if it would care, I do not expect the GCC optimiser to do such kind of transformation of the loop (as opposed to ICC which is know to do that pretty well, even sometimes automatically).

Overall, this is the problem with (OpenMP) compiler hints: in many cases, not implementing anything is completely compliant to the standard which does not require the implementation to actually generate a vectorised code.

If you want stronger guarantee on vectorisation while keeping a portable code, then you should pick one of the following option:

  • use SIMD-friendly languages like ISPC (or OpenCL/SYCL mainly targetting accelerators);
  • use a Domain Specific Language (DSL) designed for generating a fast code for such a task like Halide;
  • use an external C or C++ SIMD library like Google's Highway, XSIMD, etc.;
  • use the experimental C++ SIMD parallelism V2 TS (draft available here). This TS is supposed to be integrated in the future C++ standard (probably not C++26 but possibly C++29). C++ TS milestones are available here.

Adapting the code so to help the optimiser to generate a SIMD code, like splitting the loop in tiles with a fixed-size known at compile time (as suggested by Homer512) increases the probability for the compiler to vectorise the code, but this is still far from being guaranteed and still brittle (it may fails on another version of the same compiler and I often seen that). Using OpenMP directive tends to improve the situation but it is still rather brittle in non-trivial cases. In Clang, this solution can sometimes make the code significantly less efficient due to bad choices made by the SLP vectoriser. Moreover, this solution tends to make the code significantly more complicated in practice because of additional loops and temporary buffers for tiling and also because of duplicated code due to the loops computing the last items not fitting in SIMD registers (similar to what we do with intrinsics on SIMD instruction sets not supporting masking like SSE/AVX/Neon as opposed to AVX-512/SVE).

There is another thing to consider: mO[c * numRowsO + r] = sum; cannot be vectorised efficiently on all mainstream x86/ARM CPU yet when the c-based loop is vectorised. Indeed, the memory stores are stridded ones and not contiguous. There is a scatter instruction since AVX2 for such memory access pattern but it is very inefficient on all CPUs so far (often even slower than scalar stores). AFAIK, on ARM, SVE also supports this, but most ARM CPUs do not support SVE yet (some ARM server CPU do like the latest Graviton CPUs). Even when it is implemented rather efficiently, such scatter stores are expensive because they operate on a lot of cache lines simultaneously. It is certainly more efficient to write data contiguously in a temporary buffer and only then transpose it.


Example using ISPC

For a CPU code, I would personally pick ISPC which is often able to do such kind of non-trivial vectorisation pretty well. If you are curious, here is a naive ISPC implementation I quickly wrote (very close to your C code):

void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[], 
                            uniform int numRows, uniform int numCols, 
                            const uniform float vKRow[], uniform int rowKernelLen, 
                            const uniform float vKCol[], uniform int colKernelLen) 
{
    uniform int rowKernelRad = rowKernelLen / 2;
    uniform int colKernelRad = colKernelLen / 2;

    uniform int numRowsO = numRows - (colKernelLen - 1);
    uniform int numColsO = numCols - (rowKernelLen - 1);
    
    // Apply 1D filter along rows
    for (uniform int r = 0; r < numRows; r++) 
    {
        foreach (c = 0...numColsO)
        {
            float sum = 0.0f;
            #pragma nounroll
            for (uniform int k = 0; k < rowKernelLen; k++) 
                sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
            mB[c * numRows + r] = sum;
        }
    }
    
    // Apply 1D filter along columns
    for (uniform int r = 0; r < numColsO; r++) 
    {
        foreach (c = 0...numRowsO)
        {
            float sum = 0.0f;
            #pragma nounroll
            for (uniform int k = 0; k < colKernelLen; k++) 
                sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
            mO[c * numColsO + r] = sum;
        }
    }
}

This implementation seems to generate the expected SIMD assembly code (see on Godbold):

; Compute 8 items per iteration using AVX2
.LBB0_7:
        movsxd  rbx, ebx
        vbroadcastss    ymm6, dword ptr [r9 + 4*rax]
        vfmadd231ps     ymm5, ymm6, ymmword ptr [rdi + rbx]
        inc     rax
        add     ebx, -4
        cmp     r12, rax
        jne     .LBB0_7

Note that uniform variable can be seen as scalar shared amongst all SIMD lanes and varying are local to each lane. Variables are varying one by default (which can be harmful for performance sometimes). I advise you to read the ISPC manual for more information about this. The programming model is rather close to the one of OpenCL and CUDA.

By the way, ISPC also warns us about the inefficient scatter stores (mentioned above). The code should be optimised to avoid these horrible scatter stores. That being said, doing an efficient portable 2D transposition in ISPC can be quite challenging (like with most alternatives actually). You can still try to do a naive transposition like in C (same code but with uniform added before all variables) and see if it is fast enough to you.

Note the above assembly code is clearly sub-optimal because of the latency of the vfmadd231ps instruction (typically 3-4 cycles) and the dependency on the previous iteration. That being said, ISPC does exactly what we ask for a loop adding numbers in a SIMD-friendly way. Fortunately, ISPC supports a feature called double-pumping designed to mitigate such latency issue and better use SIMD units (see on Godbolt):

; Compute 16 items per iteration using AVX2
.LBB0_26:
        movsxd  r13, r13d
        vbroadcastss    ymm8, dword ptr [r9 + 4*r8]
        vfmadd231ps     ymm6, ymm8, ymmword ptr [rdi + r13 + 32]
        vfmadd231ps     ymm7, ymm8, ymmword ptr [rdi + r13]
        inc     r8
        add     r13d, -4
        cmp     r12, r8
        jne     .LBB0_26

This is still sub-optimal but twice better. On CPUs supporting AVX-512, quad-pumping is available resulting in a pretty-efficient SIMD assembly code without even changing the code! It should be significantly faster than a scalar implementation or even the one of GCC (except for huge kernel filters). The code is also easy to read and maintain.


Update:

After discussing with @chtz in comments, we can operate on a transposed mB (assuming this variable is only meant to be used as a temporary buffer) and change the vectorisation axis in the second filtering step while also swapping the loops for sake of performance. The code should look like this:

void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[], 
                            uniform int numRows, uniform int numCols, 
                            const uniform float vKRow[], uniform int rowKernelLen, 
                            const uniform float vKCol[], uniform int colKernelLen) 
{
    uniform int numRowsO = numRows - (colKernelLen - 1);
    uniform int numColsO = numCols - (rowKernelLen - 1);
    
    // Apply 1D filter along rows
    for (uniform int r = 0; r < numRows; r++) 
    {
        foreach (c = 0...numColsO)
        {
            float sum = 0.0f;
            #pragma nounroll
            for (uniform int k = 0; k < rowKernelLen; k++) 
                sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
            mB[r * numCols + c] = sum;
        }
    }
    
    // Apply 1D filter along columns
    for (uniform int c = 0; c < numRowsO; c++) 
    {
        foreach (r = 0...numColsO)
        {
            float sum = 0.0f;
            #pragma nounroll
            for (uniform int k = 0; k < colKernelLen; k++) 
                sum += mB[((c - k + colKernelLen - 1) * numCols) + r] * vKCol[k];
            mO[c * numColsO + r] = sum;
        }
    }
}

I don’t know how to do that portably. Still, when I recently needed efficient separable convolution, I have used the following tactics. Maybe you’ll be able to implement the following algorithm in a way modern compilers will be able to auto-vectorize.

The answer assumes your data is a 2D array stored in row-major memory layout, and the SIMD width is 8 (I had FP32 data, and targeted AVX SIMD).

Before starting the convolution, allocate a buffer for each OS thread. The size of the buffer = input width × 8, the buffer should be aligned by at least SIMD width floats, ideally by 64 bytes (cache line). The hope is, these buffers should fit in the fast L2 in-core cashes (e.g. my Zen4 CPU has 1MB L2 cache per core) which deliver very high throughput.

Split the height of the array into the panels of height equal to SIMD width. Dispatch these panels to different OS threads, an easy way to do that is parallel for in OpenMP.

The processing of each panel is a two steps process.

1. The first loop loads up to 8 rows from the source 2D array, stores into the temporary buffer with column major memory layout. If the panel is the last one of the input array, zero pad each column in the output buffer. An efficient way to transpose with AVX is loading data 4 floats at a time, and transpose two 4x4 blocks stored in 4 vectors. But note the performance of this part is not too important. The majority of time and bandwidth is spent in the second part because much more operations per element, you might be fine without vectorizing this part at all.

2. The main loop is by columns of the input, each iteration computes 8-tall column of the output. Use _mm256_broadcast_ss to load numbers from the convolution kernel; assuming your kernel is relatively small, the numbers should stay in L1D cache. Use _mm256_load_ps to load source numbers from the temporary buffer (aligned full-vector loads), _mm256_fmadd_ps to multiply and accumulate. After you processed the complete kernel and got the vector with 8 output numbers, store into the output 2D array in transposed order which is a single _mm256_storeu_ps because you have a vector with the column.

Because the output is transposed, the second pass of the convolution is the call to the same function which implements the first pass. And because two transposes cancel each other, the output memory layout will match the input.

Couple more things.

If you are doing Gaussian blur or similar, start the inner loop in the second step from the outer elements of the kernel, and move towards the center of the kernel. The reason is numerical precision: it’s beneficial to accumulate smaller numbers first.

You going to need a large temporary 2D array for the output of the first convolution pass, same as input of the second convolution pass. If your input 2D array is W×H, the temporary array needs to be H×W.

If you don’t have that many elements in the input, don’t bother with multithreading. In this case you just need a single temporary buffer, and a normal for loop.

If you need that for 3D arrays, you can still reuse the same function for all 3 passes of the convolution if you transpose cyclically i.e. the X pass of the convolution loads source data in the conventional XYZ layout and stores YZX, Y pass loads YZX and stores ZXY, the final Z pass loads ZXY and stores XYZ which matches the input.

As suggested by @JérômeRichard The Intel Implicit SPMD Program Compiler (ISPC) is the tool to force such computations.

I will analyze 2 variations.
In this variations, for simplicity, I will assume each SIMD worker is 4 elements wide.

The original code intention can be used with ISPC as following:

export void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[], 
                            uniform int numRows, uniform int numCols, 
                            const uniform float vKRow[], uniform int rowKernelLen, 
                            const uniform float vKCol[], uniform int colKernelLen) 
{
    uniform int rowKernelRad = rowKernelLen / 2;
    uniform int colKernelRad = colKernelLen / 2;
    
    uniform int numRowsO = numRows - (colKernelLen - 1);
    uniform int numColsO = numCols - (rowKernelLen - 1);
    
    // Apply 1D filter along rows
    for (uniform int r = 0; r < numRows; r++) 
    {
        foreach (c = 0...numColsO)
        {
            float sum = 0.0f;
            #pragma nounroll
            for (uniform int k = 0; k < rowKernelLen; k++)
                sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
            mB[c * numRows + r] = sum;
        }
    }
    
    // Apply 1D filter along columns
    for (uniform int r = 0; r < numColsO; r++) 
    {
        foreach (c = 0...numRowsO)
        {
            float sum = 0.0f;
            #pragma nounroll
            for (uniform int k = 0; k < colKernelLen; k++) 
                sum += mB[(r * numRows) + (c - k + colKernelLen - 1)] * vKCol[k];
            mO[c * numColsO + r] = sum;
        }
    }

}

Geometrically the first loop, working along each row split the work by SIMD workers.
Each SIMD worker loads data from the array with a contiguous load (Though not necessarily aligned).
Each SIMD worker broadcasts a multiplication with vKRow once applies all summation of the convolution it uses scattered store (Transposed) onto mB.
Assuming the length of vKRow is 3, each worker loads 3 times:

The 2nd loop does the same just working on mB as its input and mO as the output.

A 2nd variation of the ISPC code is given by:

export void SeparableConvolution2D(const uniform float mI[], uniform float mO[], uniform float mB[], 
                            uniform int numRows, uniform int numCols, 
                            const uniform float vKRow[], uniform int rowKernelLen, 
                            const uniform float vKCol[], uniform int colKernelLen) 
{
    uniform int numRowsO = numRows - (colKernelLen - 1);
    uniform int numColsO = numCols - (rowKernelLen - 1);
    
    // Apply 1D filter along rows
    for (uniform int r = 0; r < numRows; r++) 
    {
        foreach (c = 0...numColsO)
        {
            float sum = 0.0f;
            #pragma nounroll
            for (uniform int k = 0; k < rowKernelLen; k++) 
                sum += mI[(r * numCols) + (c - k + rowKernelLen - 1)] * vKRow[k];
            mB[r * numColsO + c] = sum;
        }
    }
    
    // Apply 1D filter along columns
    foreach (c = 0...numColsO)
    {
        for (uniform int r = 0; r < numRowsO; r++) 
        {
            float sum = 0.0f;
            #pragma nounroll
            for (uniform int k = 0; k < colKernelLen; k++) 
                sum += mB[(r - k + colKernelLen - 1) * numColsO + c] * vKCol[k];
            mO[r * numColsO + c] = sum;
        }
    }
}

Its first loop, along each row, is the same as above with the only difference the data is stored in a contiguous manner, with no transposition, in mB.

The 2nd loop is a bit different. As it should apply the convolution along each column, which is not contiguous, the SIMD works a bit differently.
In this case each worker loads aligned contiguous load from each row and multiply it by the corresponding value of vKCol:

This formulation avoids scattered store.

I don't know how to analyze the output code yet on my tests the 2nd variation is a bit faster.

本文标签: cOptimize a Separable Convolution for SIMD Friendly and EfficiencyStack Overflow