Cool, thanks again for your great post. Big fan of your articles.

> The functional graph becomes more complicated

Indeed. I once tried to come up with a "higher order function" that takes in a feedforward network and computes a separate function that computes the backward pass (like Theano's autodiff, but abstraction at the layer-level rather than ops).

Here's a diagram for a simple forward MLP (left to right), with the backward pass network below (right to left). I found this hard to work with, because of the explosion in the size of the computational graph when you try to decouple optimization / function. I notice something similar when trying to unroll a RNN visually across time. http://imgur.com/ATNwknh

Let me know if this is way off base from what you were talking about in your post.