Calling parallel C++ code in Python using Pybind11

I have a C++ code that runs in parallel with OpenMP, performing some long calculations. This part works great.

Now, I’m using Python to make a GUI around this code. So, I’d like to call my C++ code inside my python program. For that, I use Pybind11 (but I guess I could use something else if needed).

The problem is that when called from Python, my C++ code runs in serial with only one thread/CPU.

I tried (in two ways) to understand what is done in the documentation of pybind11 here but it does not seem to work at all.

My binding looks like that :

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include "../cpp/include/myHeader.hpp"
namespace py = pybind11;

PYBIND11_MODULE(my_module, m) {
    m.def("testFunction", &testFunction, py::call_guard<py::gil_scoped_release>());

    m.def("testFunction2", [](inputType input) -> outputType {
        /* Release GIL before calling into (potentially long-running) C++ code */
        py::gil_scoped_release release;
        outputType output =  testFunction(input);
        py::gil_scoped_acquire acquire;

        return output;
    });
}

Problem: This still does not work and uses only one thread (I verify that with a print of omp_get_num_threads() in an omp parallel region).

Question: What am I doing wrong? What do I need to do to be able to use parallel C++ code inside Python?

Disclaimer: I must admit I don’t really understand the GIL thing, particularly in my case where I do not use Python inside my C++ code, which is really “independent” in theory. I just want to be able to use it in another (Python) code.

Have a great day.

EDIT : I have solved my problem thanks to the pptaszni’s answer. Indeed, the GIL things are not needed at all, I misunderstood the documentation. pptaszni’s code worked and in fact it was a problem with my CMake file. Thank you.

Answer

It’s not really a good answer (too long for a comment thought), because I did not reproduce your problem, but maybe you can isolate the issue in your code by trying this example that works for me:

C++ code:

#include "OpenMpExample.hpp"

#include <algorithm>
#include <iostream>
#include <random>
#include <vector>

#include <omp.h>

constexpr int DATA_SIZE = 10000000;

std::vector<int> testFunction()
{
  int nthreads = 0, tid = 0;
  std::vector<std::vector<int> > data;
  std::vector<int> results;
  std::random_device rnd_device;
  std::mt19937 mersenne_engine {rnd_device()};
  std::uniform_int_distribution<int> dist {-10, 10};
  auto gen = [&dist, &mersenne_engine](){ return dist(mersenne_engine); };

  #pragma omp parallel private(tid)
  {
    tid = omp_get_thread_num();
    if (tid == 0)
    {
      nthreads = omp_get_num_threads();
      std::cout << "Num threads: " << nthreads << std::endl;
      data.resize(nthreads);
      results.resize(nthreads);
    }
  }
  
  #pragma omp parallel private(tid) shared(data, gen)
  {
    tid = omp_get_thread_num();
    data[tid].resize(DATA_SIZE);
    std::generate(data[tid].begin(), data[tid].end(), gen);
  }
  #pragma omp parallel private(tid) shared(data, results)
  {
    tid = omp_get_thread_num();
    results[tid] = std::accumulate(data[tid].begin(), data[tid].end(), 0);
  }
  for (auto r : results)
  {
    std::cout << r << ", ";
  }
  std::cout << std::endl;
  return results;
}

I tried to keep the code short, but force the machine to actually do some computations at the same time. Each thread generates 10^7 random integers and then sums them up. Then the python binding does not even require gil_scoped_release:

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include "OpenMpExample.hpp"
namespace py = pybind11;

// both versions work for me
// PYBIND11_MODULE(mylib, m) {
//     m.def("testFunction", &testFunction, py::call_guard<py::gil_scoped_release>());
// }

PYBIND11_MODULE(mylib, m) {
    m.def("testFunction", &testFunction);
}

Example output from python:

Python 3.6.8 (default, Jun 29 2020, 16:38:14) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mylib
>>> x = mylib.testFunction()
Num threads: 12
-10975, -22101, -11333, -28603, -471, -15505, -18141, 2887, -6813, -5328, -13975, -4321, 

My environment: Ubuntu 18.04.3 LTS, gcc 8.4.0, openMP 201511, python 3.6.8;