Why do I need to include to use CUDA’s printf()?

I want to printf() something in my CUDA kernel. The Programming Guide suggests I do that like so:

#include <stdio.h>

__global__ void helloCUDA(float f)
{
    printf("Hello thread %d, f=%fn", threadIdx.x, f);
}

But this is simply including the standard C library’s stdio.h. Why would that be necessary? CUDA’s printf() doesn’t have the same behavior of stdio’s printf(); and I certainly don’t need most of everything else that’s in there.

Answer

It’s an implementation detail you don’t need to know about which stems from limitations in the CUDA syntax (basically it is illegal to define different __device__ and __host__ versions of the same function).

The standard library prototype is used as a proxy in device code during compilation, and when compiling for a supported architecture, some sneaky template overloading is used to insert the device implementation into the device code.