Abstract: In this paper, we show that the Kolmogorov two hidden layer neural network model with a continuous, discontinuous bounded or unbounded activation function in the second hidden layer can precisely represent continuous, discontinuous bounded and all unbounded multivariate functions, respectively.
During the forward pass you sample a discrete outcome given your NN weights to get an error for backprop. During the backward pass you directly propogate through the weights.
This GradTree paper[1] does a good job covering how to do discrete gradient-based optimization (i.e. NNs w/ discrete representations).
Another option is to use a GFlowNet[2]. Then you have a NN policy that takes discrete actions like you're playing an RL game. You're not back-propogating through something that isn't continuous, but you're utilizing a NN to make informed decisions about a problem with a discrete representation.
Well, the approach currently doesn't apply to architectures being used in practice, and it's not clear, at least to me, if it can scale to large datasets.
Fair points. But given the potential impact this would have in practice, I'm still surprised there isn't more interest in exploring whether it's possible to scale the approach. Transforming neural network training into convex formulations would drastically simplify many current difficulties.
The paper only gives an existence result for the general case, and may require an arbitrarily complex activation function to represent arbitrarily complex multivariate functions, so it's unlikely to be useful for machine-learning applications.
I can't remember now and never gave measure theory a proper study, but isn't there a happy result that you can quotient out by a space of functions with zero integral to get an almost everywhere differentiable function or something? Maybe that's for L2:continuous, not continuous:differentiable?
Even a result like f(x) = nice(x) + evil(x) with |evil| < epsilon should be "happy enough" right?
Edit: I may have been misremembering the Lebesgue decomposition theorem which is not quite so nice, as the singular part doesn't just go away.
For a "happy enough" approximation to the Weierstrass function, you can cut off the series once a^n/(1 - a) < epsilon and get a function that's differentiable everywhere. But it'll have an extremely large number of local optima, which is still bad for gradient descent.
The paper only gives a representation result, so no method is known yet for constructing the component functions `g` for the discontinuous case. See the last paragraph of the Conclusion.
It may turn out that the correct representation can be constructed using an alternate method than backpropagation. But this is still an open question.
This isn't my field at all, so this might be silly.
I was thinking about some weird activation function like say
0 for x < 0.5
0.5 for x in [0.5, 1]
0.1x + 1 for x > 1
While it is piece-wise differentiable, similar to ReLU, I'm guessing regular backpropagation would have would struggle with it.
My probably silly thought was, what if you used a smoothed version for computing the differentials for the backpropagation step, while keeping the discontinuous function for actual evaluation? My thought was this would make backprop sensitive to the step changes in the function, while allowing for discontinuous activation.
Of course this example function is just something random without any further thought, so probably not a useful one in actual usage.
That being said, if it's only discontinuous in a few number of places you could extend the derivative everywhere by taking either the left or the right derivative, and then you'd end up with a gradient being defined everywhere, but not continuous. But then does gradient descent work if the gradient isn't continuous?
One of the common - if not the most common - activation function is 'rectified linear unit' (ReLU) which is a fancy name for y=max(x,0), which has a discontinuous gradient (1 if x>0, 0 if x<0) and that works mostly fine.
Haven't downvoted, but the "continuously" in "continuously differentiable" refers to the derivative and not to the function itself. (Also, from differentiable follows continuous, but not the other way around (sufficient, but not necessary).)
That’s not a requirement for universal approximation. It might make your life harder when you have to train the network, but that’s just an implementation detail anyway /s
It is a pretty cool result. Basically the lay summary (from my understanding) is that a multivariate function with n-arguments can be perfectly represented by a network with 2 hidden layers (one having width n and another width 2n) and two activation functions.
The representation for uncomputable functions will be uncomputable, of course. There's only a countable number of computable functions, so most functions are uncomputable, but that usually doesn't stop mathematicians from proving theorems about them.
The restriction is that it requires an activation function depending on the function it is approximating. Hence, if you want the NN to emulate an uncomputable function, the activation function probably has to be uncomputable as well.
I wonder if there's a subsubsubfield of ML research that goes like "let's use halting-evaluator as an activation function for our purely theoretical neural network". or like, "NNs with oracle"...