More ALUs don't help: we have a few that can run in a cycle on modern CPUs, but ...

More ALUs don't help: we have a few that can run in a cycle on modern CPUs, but we still get IPCs < 1.0. The bottleneck is getting the arguments and operation into the ALU for it to run. That's what the cache is for. As you pointed out, it's horribly inefficient (in % silicon space, but that's an odd metric, isn't it?) in some ways, but that's what you get for generality.

If you want something faster, you'll need to alter the upper level systems to make feeding the ALU(s) easier. (And you'll still have Amdahl's law to contend with -- which is a major factor even at the single-core scale.) AFAICT, you're asking for a system as high level as Go or MATLAB, but have it compile itself into some sequence of FPGA programs and I/Os on it. Altering those upper level systems (to the point of rewrite with a substantial amount of research work in the middle) is a lot of work, and efficiency at the silicon level isn't enough justification for it. Crude work partitioning by the programmer and more cores/machines often works well for those who actually need the additional performance.

But, there's something reasonable on regular silicon that you may like. Haskell's got auto-parallelization features, as well as the ability to compile code to run on CUDA. http://chimera.labs.oreilly.com/books/1230000000929 Is a book (available to read freely online) covering these features. You can get the background here http://book.realworldhaskell.org/