Kolmogorov Neural Networks can represent discontinuous functions

arjvik · on Nov 5, 2023

Abstract: In this paper, we show that the Kolmogorov two hidden layer neural network model with a continuous, discontinuous bounded or unbounded activation function in the second hidden layer can precisely represent continuous, discontinuous bounded and all unbounded multivariate functions, respectively.

littlestymaar · on Nov 5, 2023

How can you back-propagate through it if it's not continuous (and as such, non-differentiable)?

cgreerrun · on Nov 5, 2023

You can use the straight through operator!

During the forward pass you sample a discrete outcome given your NN weights to get an error for backprop. During the backward pass you directly propogate through the weights.

This GradTree paper[1] does a good job covering how to do discrete gradient-based optimization (i.e. NNs w/ discrete representations).

Another option is to use a GFlowNet[2]. Then you have a NN policy that takes discrete actions like you're playing an RL game. You're not back-propogating through something that isn't continuous, but you're utilizing a NN to make informed decisions about a problem with a discrete representation.

[1] GradTree (https://arxiv.org/pdf/2305.03515.pdf) [2] GFlowNet (https://arxiv.org/abs/2111.09266)

smaddox · on Nov 5, 2023

It might also be possible to solve a corresponding convex optimization problem, similar to https://arxiv.org/pdf/2303.03382.pdf

ubj · on Nov 5, 2023

If an appropriate dual problem is found this could be possible.

I'm baffled why Mert Pilanci's work in this area hasn't received more attention. His proofs of a zero duality gap for neural networks are impressive.

smaddox · on Nov 5, 2023

Well, the approach currently doesn't apply to architectures being used in practice, and it's not clear, at least to me, if it can scale to large datasets.

ubj · on Nov 6, 2023

Fair points. But given the potential impact this would have in practice, I'm still surprised there isn't more interest in exploring whether it's possible to scale the approach. Transforming neural network training into convex formulations would drastically simplify many current difficulties.

lucidrains · on Nov 5, 2023

gradtree is cool! thanks for the share!

cgreerrun · on Nov 6, 2023

You're welcome! If you're interested, there is a follow up paper about a method that extends GradTree to a boosted tree as well called GRANDE[1].

[1]https://arxiv.org/abs/2309.17130

yorwba · on Nov 5, 2023

Even in the continuous case, there are nowhere-differentiable continuous functions https://en.wikipedia.org/wiki/Weierstrass_function where attempting to use back-propagation is unlikely to go well.

The paper only gives an existence result for the general case, and may require an arbitrarily complex activation function to represent arbitrarily complex multivariate functions, so it's unlikely to be useful for machine-learning applications.

ndriscoll · on Nov 5, 2023

I can't remember now and never gave measure theory a proper study, but isn't there a happy result that you can quotient out by a space of functions with zero integral to get an almost everywhere differentiable function or something? Maybe that's for L2:continuous, not continuous:differentiable?

Even a result like f(x) = nice(x) + evil(x) with |evil| < epsilon should be "happy enough" right?

Edit: I may have been misremembering the Lebesgue decomposition theorem which is not quite so nice, as the singular part doesn't just go away.

yorwba · on Nov 5, 2023

For a "happy enough" approximation to the Weierstrass function, you can cut off the series once a^n/(1 - a) < epsilon and get a function that's differentiable everywhere. But it'll have an extremely large number of local optima, which is still bad for gradient descent.

ubj · on Nov 5, 2023

The paper only gives a representation result, so no method is known yet for constructing the component functions `g` for the discontinuous case. See the last paragraph of the Conclusion.

It may turn out that the correct representation can be constructed using an alternate method than backpropagation. But this is still an open question.

magicalhippo · on Nov 5, 2023

This isn't my field at all, so this might be silly.

I was thinking about some weird activation function like say

    0 for x < 0.5
    0.5 for x in [0.5, 1]
    0.1x + 1 for x > 1

While it is piece-wise differentiable, similar to ReLU, I'm guessing regular backpropagation would have would struggle with it.

My probably silly thought was, what if you used a smoothed version for computing the differentials for the backpropagation step, while keeping the discontinuous function for actual evaluation? My thought was this would make backprop sensitive to the step changes in the function, while allowing for discontinuous activation.

Of course this example function is just something random without any further thought, so probably not a useful one in actual usage.

mandarax8 · on Nov 5, 2023

Doesn't have to be continuously differentiable to be differentiable right?

littlestymaar · on Nov 5, 2023

I need to be continuous to be differentiable.

That being said, if it's only discontinuous in a few number of places you could extend the derivative everywhere by taking either the left or the right derivative, and then you'd end up with a gradient being defined everywhere, but not continuous. But then does gradient descent work if the gradient isn't continuous?

PeterisP · on Nov 5, 2023

One of the common - if not the most common - activation function is 'rectified linear unit' (ReLU) which is a fancy name for y=max(x,0), which has a discontinuous gradient (1 if x>0, 0 if x<0) and that works mostly fine.

amelius · on Nov 5, 2023

You can also extend the derivative such that you compute it numerically and then some slight discontinuities are certainly not a problem.

Kevin09210 · on Nov 5, 2023

Don't downvote. Explain

pvitz · on Nov 5, 2023

Haven't downvoted, but the "continuously" in "continuously differentiable" refers to the derivative and not to the function itself. (Also, from differentiable follows continuous, but not the other way around (sufficient, but not necessary).)

tnecniv · on Nov 5, 2023

That’s not a requirement for universal approximation. It might make your life harder when you have to train the network, but that’s just an implementation detail anyway /s

codethief · on Nov 5, 2023

Sounds like this improves upon the known results for continuous functions? https://en.m.wikipedia.org/wiki/Universal_approximation_theo...

sega_sai · on Nov 5, 2023

It is a pretty cool result. Basically the lay summary (from my understanding) is that a multivariate function with n-arguments can be perfectly represented by a network with 2 hidden layers (one having width n and another width 2n) and two activation functions.

petters · on Nov 5, 2023

The abstract says it can represent "all unbounded multivariate functions." But there are uncomputable functions, so there must be some restriction.

yorwba · on Nov 5, 2023

The representation for uncomputable functions will be uncomputable, of course. There's only a countable number of computable functions, so most functions are uncomputable, but that usually doesn't stop mathematicians from proving theorems about them.

magnio · on Nov 5, 2023

The restriction is that it requires an activation function depending on the function it is approximating. Hence, if you want the NN to emulate an uncomputable function, the activation function probably has to be uncomputable as well.

valyagolev · on Nov 6, 2023

I wonder if there's a subsubsubfield of ML research that goes like "let's use halting-evaluator as an activation function for our purely theoretical neural network". or like, "NNs with oracle"...

scotty79 · on Nov 5, 2023

What is a Kolmogorov Neural Network?

esafak · on Nov 5, 2023

See Figure 1 and https://cs.uwaterloo.ca/~y328yu/classics/Hecht-Nielsen.pdf

dboreham · on Nov 5, 2023

Oh, there actually is a Figure 1.

bilsbie · on Nov 5, 2023

How would this ability show up if we built gpt out of these? More out of the box thinking?