portunlimited.blogg.se - Cuda very expensive cudalaunch calls

#CUDA VERY EXPENSIVE CUDALAUNCH CALLS DRIVER#

From 10 usec to 80 usec for the launch only, that is, without a call to cudaDeviceSynchronize().

#CUDA VERY EXPENSIVE CUDALAUNCH CALLS DRIVER#

If you now repeat the experiment on Windows, with the WDDM driver model (which is what you are stuck with when using a consumer GPU) you will see something like this: Already with a null kernel, timing is all over the place. If you then make the kernel launch more complex by adding bound textures, and time the increment due to each additional texture, I think you will find pretty much the timing stated in the blog article. On a Linux system, with a modern CUDA version, using a null kernel, you would find that each launch takes about 5 usec, and 20 usec with the cudaDeviceSynchronize() added back in. You can leave out the cudaDeviceSynchronize() after the kernel is you want. The timing numbers I’m getting are from the MSVC CUDA profiler. The main loop is running around 8 kernels, and all but one of them are very simple. What else might cause kernel launches to be this slow? I’m in Windows 7, with a Geforce 750 Ti on 347.88 drivers. There are around 20-30 of those, but they’re not used by the small kernels that are being called a lot.) I have about 30 textures, which would account for about 15uS, which doesn’t take me anywhere near the 150uS I’m seeing. The numbers I’ve heard are on the order of 5uS plus about 0.5uS per texture/surface. Launching kernels is relatively expensive, but it sounds like this is an order of magnitude slower than it should be. This takes the smaller kernel launches and makes them much slower than they should be. I’m profiling a slow application, and I’m seeing that every kernel launch’s cudaLaunch call is taking around 150-200uS.