OpenMP performance optmization with a large array (matrix)

I am fairly new to OpenMP, sorry if the question seemes reduantant. Here is my sequential code that executes do_something() to every element in the row and save to the next row:

    for (int i = column; i < row * column; i++)
        A[n] = do_something(A[n - column]);

I tried to parallize it with simple parallel for

for (int i = 1; i < row; i++)
{   
    // I put column iteration in the inside loop
    // becuaes of cache miss and I tried to schedule the
    // same thread on the same cache line to avoid false 
    // sharing
    #pragma omp parallel for schedule(static, cacheline_size)
    for (int j = 0; j < column; j++)
    {
        A[i * column+ j] = do_something(A[(i - 1) * column+ j]);
    }
    
}

However I am only getting around 1.5x speedup per thread and the speedup capped around 4 threads with 2.5x speedup (I have enough cores).

I attempt to put the column loop outside but it is even slower than the sequential code. I suspect it is the thread creations in each loop, is there a way to improve this?

Edit: I am using g++ on Ubuntu 20.04

Answer

The loop over j is problematic. Starting thread team and synchronization at the end of parallel section is quite costly, especially for a large number of threads.

Moving loop outside breaks locality of reference. The classic solution to this problem is tiling. Split the problematic loop into two. One with step of cacheline over the whole range. Another with step of 1 through range 0 to cacheline-1. The outer loop is used to exploit parallelization and it should be move outside. While the innerloop exploits locality of reference and it should be moved inside.

#pragma omp parallel for
for (int j0 = 0; j0 < column; j0+=cacheline_size) {
for (int i = 1; i < row; i++) {
   int j1 = MIN(j0 + cacheline_size, column);
   for (int j = j0; j < j1; ++j)
        A[i * column+ j] = do_something(A[(i - 1) * column+ j]);
}}

Selecting the best step usually requires some experimentation.