Rouge Task in 3rd party library hanging the main thread and causing strange behaviour. How can I detect the hang and retry safely? C#

I am working in a C# console application with a 3rd party library and awaiting a Task that the 3rd party library is running. I’ve used Async await “all the way down”. The full details of the issue I’m having are here but that’s quite a long post so I’d like to simplify the problem here and try to abstract it from the details.

Essentially (very simplified) the code is the following:

    public static async Task<byte[]> CaptureRawData()
    {
        await Camera.Capture();
        return Camera.CurrentStream.ToArray();
    }

This is capturing images on my Raspberry pi and returning the image as a byte array in memory. It captures an image around every 10 seconds or so. After running for several hours, the Camera.Capture(); line randomly hangs indefinitely.

I’ve read that this might be caused due to intermittent power brownouts, but regardless, I just want to be able to detect the hang and try again. It only happens once every few hours, so I don’t really mind if I miss one image, I just want to be able to carry on and retry without freezing the main thread indefinitely.

I was inspired by this other SO question to try to run the task with a timeout so that I can just try again if it times out.

I adapted one of the answers provided to give the following:

public static async Task<bool> CancelAfterAsync(Task startTask, TimeSpan timeout)
    {
        using (var timeoutCancellation = new CancellationTokenSource())
        {
            var delayTask = Task.Delay(timeout, timeoutCancellation.Token);
            Serilog.Log.Logger.Debug("await Task.WhenAny");
            var completedTask = await Task.WhenAny(startTask, delayTask);
            Serilog.Log.Logger.Debug("Finished await Task.WhenAny");
            // Cancel timeout to stop either task:
            // - Either the original task completed, so we need to cancel the delay task.
            // - Or the timeout expired, so we need to cancel the original task.
            // Canceling will not affect a task, that is already completed.
            timeoutCancellation.Cancel();
            if (completedTask == startTask)
            {
                // original task completed
                Serilog.Log.Logger.Debug(" await startTask;");
                await startTask;
                return true;
            }
            else
            {
                Serilog.Log.Logger.Debug("Timed out");
                // timeout
                return false;
            }
        }
    }

    public static async Task<byte[]> CaptureRawData()
    {
        Serilog.Log.Logger.Debug("Running task");
        if (await TaskUtils.CancelAfterAsync(Camera.Capture(), TimeSpan.FromSeconds(100)))
        {
            Serilog.Log.Logger.Debug("Got response, returning data");
            return Camera.CurrentStream.ToArray();
        }
        else
        {
            Serilog.Log.Logger.Warning("Camera timed out, return null");
            return null;
        }
    }

    public static async Task<byte[]> CaptureImage()
    {
        byte[] data = null;

        for (var i = 0; i < 10; i++)
        {
            data = await CaptureRawData();
            if (data != null)
            {
                break;
            }
            else
            {
                Serilog.Log.Logger.Warning("Camera timed out, retrying");
            }
        }

        if (data == null || data.Length == 0)
        {
            //Todo: Better exception message
            throw new Exception("Image capture failed");
        }

        return data;
    }

Now, upon hanging, it should detect the hang and retry up to 10 times. But instead I get the following logging output:

[13:54:54 DBG] Running Task

[13:54:54 DBG] await Task.WhenAny

[13:56:34 DBG] Finished await Task.WhenAny

[13:56:34 DBG] Timed out

[13:56:34 WRN] Camera timed out, return null

It then hangs indefinitely on the “return null” line, it should log “Camera timed out, retrying” straight after this line, but it never does, just hangs forever on “return null”.

This makes no sense, because the CancelAfterAsync method has clearly detected the hang and returned false, but it’s the parent method which then hangs.

How can I just detect the hang and retry safely?

As explained before, it only happens rarely, once every few hours after calling this method hundreds of times, so I just want to be able to detect that it’s happened and retry without locking everything up.

EDIT: As suggested in the comments, I tried running the rogue task inside a Task.Run, and removed all async from my program, like the following:

    public static class MemoryCapture
    {
        private static volatile bool _camProcessing = false;

        public static byte[] CaptureRawData()
        {
            MMALCamera cam = MMALCamera.Instance;
            MMALCameraConfig.Debug = true;
            MMALCameraConfig.StillEncoding = MMALEncoding.BGR24;
            MMALCameraConfig.StillSubFormat = MMALEncoding.BGR24;
            using (var imgCaptureHandler = new MemoryStreamCaptureHandler())
            using (var renderer = new MMALNullSinkComponent())
            {
                cam.ConfigureCameraSettings(imgCaptureHandler);
                cam.Camera.PreviewPort.ConnectTo(renderer);

                // Camera warm up time
                Thread.Sleep(2000);

                if (WaitForCam(cam))
                {
                    var result = imgCaptureHandler.CurrentStream.ToArray();
                    return result;
                }
                else
                {
                    Serilog.Log.Logger.Warning($"Reached timeout, returning null...");
                    return null;
                }
            }
        }

        private static bool WaitForCam(MMALCamera cam)
        {
            _camProcessing = true;
            Serilog.Log.Logger.Debug("Running cam process task");
            Task.Run(() =>
            {
                Serilog.Log.Logger.Debug($"cam.ProcessAsync");
                cam.ProcessAsync(cam.Camera.StillPort).ConfigureAwait(false).GetAwaiter().GetResult();
                Serilog.Log.Logger.Debug($"cam.ProcessAsync finished");
                _camProcessing = false;
            });
            for (var i = 0; i < 1000; i++)
            {
                Thread.Sleep(100);
                if (!_camProcessing)
                {
                    Serilog.Log.Logger.Debug($"cam processing finished");
                    return true;
                }
            }
            Serilog.Log.Logger.Warning($"Reached timeout, camera might have locked up");
            return false;
        }

        public static byte[] CaptureImageHelper()
        {
            byte[] data = null;
            for (var i = 0; i < 10; i++)
            {
                data = CaptureRawData();
                if (data != null)
                {
                    break;
                }
                Serilog.Log.Logger.Warning($"Retrying...");
            }

            if (data == null)
            {
                throw new Exception("Image capture failed");
            }

            return data;
        }

    }
}

The log output from that code is the following:

[23:02:29 DBG] Running cam process task

[23:02:29 WRN] cam.ProcessAsync

[23:04:09 WRN] Reached timeout, camera might have locked up

[23:04:09 WRN] Reached timeout, returning null…

It then hangs forever.

It’s hanging when returning from CaptureRawData, so it’s possible that it’s hanging while disposing one of the usings.

Answer

How can I detect the hang and retry safely?

There’s only one answer to this: run that code in a separate process. You can redirect stdin/stdout to act as a “command/response” channel, and if it ever takes too long to respond, kill the entire process and restart it. This is fairly heavyweight, but it’s the only way to properly cancel uncancelable code.

The reason a separate process is necessary is because your code needs to clean up everything that API was doing, and in the case of a hung API, the only way to do that is to have the OS step in and do cleanup. This is particularly the case where an API likely has some kind of exclusive lock on a hardware resource.

The problem with CancelAfterAsync is that it only cancels the waiting of the task. The Camera.Capture method is still in progress, potentially holding any hardware resources open, presumably indefinitely. So even if you get that working, there’s no guarantee that starting a second Camera.Capture will work at all. It’s much cleaner to have Camera.Capture in a separate process, kill that process (having the OS come in and clean up everything including handles to hardware resources), and then restart it.

It’s hanging when returning from CaptureRawData, so it’s possible that it’s hanging while disposing one of the usings.

That is very likely. Since there is already a Camera.Capture running (and hung), it’s possible that the disposal is waiting to access the hardware resource that is indefinitely in use. Again, this should be fixed by using a separate process, since the OS will step in and force those handles closed when the process is killed.

Leave a Reply

Your email address will not be published. Required fields are marked *