Pytorch SSLError on Dataloader when Workers are greater than 1

I have created a Dataset object that loads some data from an API when loading an item

class MyDataset(Dataset):

    def __init__(self, obj_ids = []):
        """
        """
        super(Dataset, self).__init__()

        self.obj_ids = obj_ids

    def __len__(self):
        return len(self.obj_ids)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        result = session.get('/api/url/{}'.format(idx))

        ## Post processing work...

Then I add it to my Dataloader:

data_loader = torch.utils.data.DataLoader(
              dataset, batch_size=2, shuffle=True, num_workers=1,
              collate_fn=utils.collate_fn)

Everything works fine when training this with num_workers=1. But when I increase it to 2 or greater I get an error in my training loop.

On this line:

train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)

SSLError: Caught SSLError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1373, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 319, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 280, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.7/dist-packages/urllib3/connectionpool.py", line 638, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/local/lib/python3.7/dist-packages/urllib3/util/retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='mydomain.com', port=443): Max retries exceeded with url: 'url_with_error_is_here' (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2570)')))

If I remove the post request, I stop getting the SSL error, so the problem most be something with the requests.post library or urllib maybe.

I changed the domain and url on the error to dummy values, but both url’s and domains work when having just 1 worker.

I’m running this in a google collab environment with GPU enabled, but also tried it on my local machine and getting the same problem.

Can anyone help me to solve this issue?

Answer

After debugging a bit and reading more about multiprocessing and request.session. It seems that the problem is that I cannot use requests.session inside a dataset as pytorch eventually uses multiprocessing on the training loop.

More about it on this question: How to assign python requests sessions for single processes in multiprocessing pool?

The issue is fixed by changing any session.get or session.post to a requests.get and requests.post as using it without session will avoid sharing the same connection and getting that SSLError.