Why does OMP nested parallelism execution outputs differently than linear execution?

I’m attempting to compare values of execution time when detecting edges of an image in a linear way and in a parallel way. Everything works fine in a linear way, but in a parallel way the image written has too much white pixels in a part of the image. To better show what I’m saying, see image below:

Linear Execution(left)vsParallel execution(right)

The left image is the output of the code executed linearly, and in the right is using parallelism. You can see the edges of the buildings in both images, and the bottom part of the right image close to its border doesen’t have the same issue has the rest of it.

I cropped the “critical” part of the code that does this tasks, in hope that someone may know what may be causing this.

omp_set_nested(1);
#pragma omp parallel
while(col<cols-1) {
    line = 1;
    #pragma omp parallel
    while(line<lines-1) {
        gradient_x = 0;
        gradient_y = 0;
        for(int m = 0; m < mask_size; m++) {
            for(int n = 0; n < mask_size; n++) {
                int np_x = line + (m - 1);
                int np_y = col + (n - 1);

                float v = img(np_y,np_x);

                int mask_index = (m*3) + n;

                gradient_x = gradient_x + (x_mask[mask_index] * v);
                gradient_y = gradient_y + (y_mask[mask_index] * v);
            }
        }
        float gradient_sum = sqrt((gradient_x * gradient_x) + (gradient_y * gradient_y));
        if(gradient_sum >= 255) {
            gradient_sum = 255;
        } else if(gradient_sum <= 0) {
            gradient_sum = 0;
        }
        output(line, col) = gradient_sum;
        #pragma omp critical
        line++;
    }
    #pragma omp critical
    col++;
}

I defined line and col variables as critical because they are the ones used for both reading and writing data, and I believe everything else is working propperly.

Answer

Without more context, is hard to tell. Nonetheless, those two nested parallel regions do not make sense, because you are not distributing tasks among threads; instead you are just executing the same code by multiple threads, with possible race-conditions on the updates of the variables gradient_x and gradient_y among others. Start with the following simpler parallel code:

omp_set_nested(0);
while(col<cols-1) {
    line = 1;
    while(line<lines-1) {
        gradient_x = 0;
        gradient_y = 0;
        #pragma omp parallel for reduction(+:gradient_x,gradient_y)
        for(int m = 0; m < mask_size; m++) {
            for(int n = 0; n < mask_size; n++) {
                int np_x = line + (m - 1);
                int np_y = col + (n - 1);
                float v = img(np_y,np_x);
                int mask_index = (m*3) + n;

                gradient_x = gradient_x + (x_mask[mask_index] * v);
                gradient_y = gradient_y + (y_mask[mask_index] * v);
            }
        }
        float gradient_sum = sqrt((gradient_x * gradient_x) + (gradient_y * gradient_y));
        if(gradient_sum >= 255) {
            gradient_sum = 255;
        } else if(gradient_sum <= 0) {
            gradient_sum = 0;
        }
        output(line, col) = gradient_sum;
        line++;
    }
    col++;
}

You can try the following:

#pragma omp parallel for collapse(2)
for(int col = 0; col<cols-1; col++) {
    for(int line = 1; line<lines-1; line++) {
        float gradient_x = 0;
        float gradient_y = 0;
        for(int m = 0; m < mask_size; m++) {
            for(int n = 0; n < mask_size; n++) {
                int np_x = line + (m - 1);
                int np_y = col + (n - 1);
                float v = img(np_y,np_x);
                int mask_index = (m*3) + n;
                gradient_x = gradient_x + (x_mask[mask_index] * v);
                gradient_y = gradient_y + (y_mask[mask_index] * v);
            }
        }
        float gradient_sum = sqrt((gradient_x * gradient_x) + 
                                  (gradient_y * gradient_y));
        if(gradient_sum >= 255) {
            gradient_sum = 255;
        } else if(gradient_sum <= 0) {
            gradient_sum = 0;
        }
        output(line, col) = gradient_sum;
    }
}

Of course, you need to check the race-condition in the code that you have cropped.