Fun fact: Long before the dawn of the GPU deep learning hype (and even before CUDA was a thing), a bunch of CS nerds from Korea managed to train a neural network on an ATI (now AMD) Radeon 9700 Pro using nothing but shaders [1]. They saw an even bigger performance improvement than Hinton and his group did for AlexNet 8 years later using CUDA.
Cool, I had not heard about this. Adding this paper to my machine learning teaching bibliography.
Even though the start of the deep learning renaissance is typically dated to 2012 with Alexnet, things were in motion week before that. As you point out, GPU training was validated at least 8 years previously. Concurrently, some very prescient researchers like Li were working hard to generate large scale datasets like ImageNet (CVPR 2009, https://www.image-net.org/static_files/papers/imagenet_cvpr0...). And in 2012 it all came together.
Since shaders were designed for, well, shading, this early experiment was more of an academic playground exercise than useful research. But AlexNet still wasn't the first deep neural network trained using CUDA. It had already been done three years earlier: https://dl.acm.org/doi/10.1145/1553374.1553486
The ImageNet competition had also been around since 2010. So the ingredients were actually all there before.
[1] https://ui.adsabs.harvard.edu/abs/2004PatRe..37.1311O/abstra...