This was a great essay, and as someone who struggles a lot with hyperawareness OCD, I cried reading it.
First on a positive note, the example about attention on sex and arousal feeding back on itself and deepening the experience is well described and easy to relate to. But I think the "deepening an experience through attention" phenomenon applies in so many other domains as well - Sustained attention on a film or video game world, deep uninterrupted creative work for many hours, etc. It's a wonderful positive feedback loop.
It is somewhat similar to how when sitting in silence outside for a long period of time you begin to become aware of more and more subtle details of the experience that weren't immediately accessible. Almost like you're turning up the sensitivity knob on things.
Unfortunately as the author describes, the attention feedback loop can become unpleasant and even torturous when it is directed on negative sensations. For me it has been various things at different stages of my life - muscle tension, breathing, eye floaters in my vision, etc. The same process plays out - Sustained fixation of attention on the sensation increases your sensitivity to it, meaning you notice it more and it bothers you more, meaning you pay more attention to it, and it gets out of control.
The difficulty I experience is that this attention is unwanted and yet I feel my mind focus on it almost automatically. Paradoxically, most of the treatment/recovery advice for this type of OCD is to allow these sensations to be there without rejecting them, which I'm still working on.
But it is helpful to see the positive flip side of the coin too - Our minds are capable of deep focus and deep attention, which can increase sensitivity and let you see increasingly subtle details of experience, making you a better appreciator of art and life, a better creator, a better listener and friend, etc.
Right, but isn't it a problem that a quine also requires the information contained in the language's compiler/interpreter to be fully meaningful? This would be "outside the universe" so to speak.
A minimal quine just prints itself out as "source" code. You can choose the source language to be whatever you like, such as a minimal Turing complete combinator. So all you need is an interpreter for the base level, which could be something as simple as Rule 110[1].
It really doesn't matter what Turing-complete language you choose; they can all be implemented in terms of each other, so as soon as you have your quine one language you could as well do it any other.
Sort of. The very first compiled binary of any new language has to first be written in a language that already has a compiler, and the very first compiler of any high-level language at all had to be written in assembler.
Ultimately, if you can't write a Quine directly in logic gates, which you can't because no microprocessor can output another microprocessor, you need something external to the "universe" of the language.
The entire US benefits from rules only California makes for itself. The entire world benefits from rules only the EU makes for itself.
That very same globalization sometimes causes influence as a pure byproduct.
Regardless, it's not an all or nothing situation. If there are 6000 problems, it's perfectly fine to pick just one of them, and do something that only makes it 1% better. So what there are 5999 other problems? So what most of this one problem didn't even get better because there is some way around it? Tomorrow you just keep doing more of the same and make the current problem 2% better. Or make one of the other 5999 a little better.
It really only takes the EU and US agreeing that something is bad. Together they represent a huge chunk of the imports of nearly every sector of the global economy. In practice that is what globalization largely is - other countries producing stuff that the US and EU buy.
If you manufacture a chemical and the US and EU both decide that chemical is banned and can't be imported, your business may very well be toast, and you will have a large incentive to produce stuff that they want instead. This doesn't necessarily even take legislation as the relevant governments have a variety of ways they can apply tariffs, disincentives etc. to stuff they don't like.
It doesn’t even take both agreeing necessarily. In many cases just either one of the EU or the US legislating something is sufficient financial incentive to follow the same rule globally.
You establish agreements over a large enough market that it actually makes a different so that competitors doing the right thing aren’t penalized or aren’t penalized enough that they’re disincentivized.
Sometimes that market is the US and you only need a local bill to do that. Sometimes you do a bilateral or multilateral treaty. But these things happen all the time even today. Eg if not for trump there would probably have been a new pacific trade agreement and Biden has managed to rally multilateral cooperation around security issues around Russia (eg see new applications to NATO as one example as well as related economic agreements and sanctions) and climate change.
The main challenge is that Trump’s behavior of cancelling deals that were on the finish line as well as backing out of existing agreements (Paris accords + Iran) means that partners are now more wary of entering into agreements in the first place.
This has less to do with globalization and more to do with USA’s increasing instability as a reliable partner that can execute on agreed-upon commitments across political transitions. So now countries are less reliant on commitments from the USA. From one perspective, that’s good - local autonomy is a powerful tool. From a different perspective, the USA frequently (not always and maybe not even a majority of the time) lead the way and set a global direction through unilateral action (ie even without treaties). Again, that’s gone less because of globalization I think and more that the world has grown tired of USA’s political weight throwing and realpolitik behavior rather than sticking to common principles (democracy, rule of law, not torturing enemies no matter what they’ve done, human rights, raising up scientists and facts even when politically inconvenient or leaders applying pressure to move their voter base instead of playing tail wag the dog, having principles about who we count as our allies etc). One set of behaviors engenders trust while the other degrades it and leads to whatsboutisim politics. Agreements in low trust environments are rarer and harder to maintain. Agreements in high trust environments are much cheaper.
So, while I agree that the “governments around the world coordinating” is harder I disagree that it’s impossible or that it was caused by globalization Val multi-generational realpolitik weight throwing and underhanded behavior that killed a lot of the good will brought about as the “saviors of WWII” propaganda that was wide spread + “golden city” aura post WWII and during the Cold War + technological and economically outpacing Russia. We’ve been leveraging that good will more than our finances because the former is completely invisible and impossible to quantify and measure.
Great overview, I think the part for me which is still very unintuitive is the denoising process.
If the diffusion process is removing noise by predicting a final image and comparing it to the current one, why can't we just jump to the final predicted image? Or is the point that because its an iterative process, each noise step results in a different "final image" prediction?
In the reverse diffusion process, the reason we can't directly jump from a noisy image at step t to a clean image at step 0 is that each possible noisy image at step t may be visited by potentially many real images during the forward diffusion process. Thus, our model which inverts the diffusion process by minimizing least-squares prediction error of a clean image given a noisy image at step t will learn to predict the mean over potentially many real images, which is not a itself a real image.
To generate an image we start with a noise sample and take a step towards the _mean_ of the distribution of real images which would produce that noise sample when running the forward diffusion process. This step moves us towards the _mean_ of some distribution of real images and not towards a particular real image. But, as we take a bunch of small steps and gradually move back through the diffusion process, the effective distribution of real images over which this inverse diffusion prediction averages has lower and lower entropy, until it's effectively a specific real image, at which point we're done.
> But, as we take a bunch of small steps and gradually move back through the diffusion process...
...but, the question is, why can't we take a big step and be at the end in one step.
Obviously a series of small steps gets you there, but the question was why you need to take small steps.
I feel like this is just a 'intuitive explanation' that doesn't actually do anything other than rephrase the question; "You take a series of small steps to reduce the noise in each step and end up with a picture with no noise".
The real reason is that big steps result in worse results (1); the model was specifically designed to be a series of small steps because when you take big steps, you end up with over fitting, where the model just generates a few outputs from any input.
The reason why big steps produce worse results, when using current architectures and loss functions, is precisely because the least squares prediction error and simple "predict the mean" approach used to train the inverse model does not permit sufficient representational capacity to capture the almost always multimodal conditional distribution p(clean image | noisy image at step t) that the inverse model attempts to approximate.
Essentially, current approaches rely strongly on an assumption that the conditional we want to estimate in each step of the reverse diffusion process is approximately an isotropic Gaussian distribution. This assumption breaks down as you increase the size of the steps, and models which rely on the assumption also break down.
This is not directly related to overfitting. It is a fundamental aspect of how these models are designed and trained. If the architecture and loss function for training the inverse model were changed it would be possible to make an inverse model that inverts more steps of the forward diffusion process in a single go, but then the inverse model would need to become a full generative model on its own.
> This assumption breaks down as you increase the size of the steps, and models which rely on the assumption also break down.
Hm. Why's that?
The only reason I mentioned over fitting is because that's literally what they say in the paper I linked, that the diffusion factor was selected to prevent over fitting.
...
I guess I don't really have a deep understanding of this stuff, but your explanation seems to be missing, specifically that noise is added to the latent each round, on a schedule (1), less noise each round.
that's what causes it to converge on a 'final' value; you're explicitly modifying the amount of additional noise you feed in. If you don't add any noise, you get nothing more from doing 1 step than you do from 10 or 50.
Right?
"as we take a bunch of small steps and gradually move back through the diffusion process, the effective distribution of real images over which this inverse diffusion prediction averages has lower and lower entropy"
I'm really not sure about that... :/
(1) - "For binomial diffusion, the discrete state space makes gradient ascent with frozen noise impossible. We instead choose the forward diffusion schedule β1···T to erase a constant fraction 1
T of the original signal per diffusion step,
yielding a diffusion rate of βt = (T − t + 1)−1."
To train the inverse diffusion model, we take a clean image x0 and generate a noisy sample xt which is from the distribution over points that x0 would visit following t steps of forward diffusion. For any value of t, any xt which is visited by x0 is also visited by some other clean images x0' when we run t steps of diffusion starting from those x0'. In general, there will be many such x0' for any xt which our initial x0 might visit after t steps of forward diffusion.
If t is small and the noise schedule for diffusion adds small noise at each step, then the inverse conditional p(x0 | xt) which we want to learn will be approximately a unimodal Gaussian. This is an intrinsic property of the forward diffusion process. When t is large, or the diffusion schedule adds a lot of noise at each step, the conditional p(x0 | xt) will be more complex and include a larger fraction of the images in the training set.
"If you don't add any noise, you get nothing more from doing 1 step than you do from 10 or 50." -- there are actually models which deterministically (approximately) integrate the reverse diffusion process SDE and don't involve any random sampling aside from the initial xT during generation.
For example, if t=T, where T is the total length of the diffusion process, then xt=xT is effectively an independent Gaussian sample and the inverse conditional p(x0 | xT) is simply p(x0) which is the distribution of the training data. In general, p(x0) is not a unimodal isotropic Gaussian. If it was, we could just model our training set by fitting the mean and (diagonal) covariance matrix of a Gaussian distribution.
"I'm really not sure about that... :/" -- the forward diffusion process initiated from x0 iteratively removes information about the starting point x0. Depending on the noise schedule, the rate at which information about x0 is removed by the addition of noise can vary. Whether we're in the continuous or discrete setting, this means the inverse conditional p(x0 | xt) will increase in entropy as t goes from 1 to T, where T is the max number of diffusion steps. So, when we generate an image by running the inverse diffusion process the conditional p(x0 | xt) will have shrinking entropy as t is now decreasing.
The quote (1) you reference is about how trying to directly optimize the noise schedule is more challenging when working with discrete inputs/latents. Whether the noise schedule is trained, as in their continuous case, or defined a priori, as in their discrete case, each step of forward diffusion removes information about the input and what I said about shrinking entropy of the conditional p(x0 | xt) as we run reverse diffusion holds. In the case of current SOTA diffusion models, I believe the noise schedules are set via hyperopt rather than optimized by SGD/ADAM/etc.
> why can't we take a big step and be at the end in one step.
Because we're doing gradient descent. (No, seriously, it's turtles all the way down (or all the way up, considering we're at a higher level of abstraction here).)
We're trying to (quickly, in less than 100 steps) descend a gradient through a complex, irregular and heavily foggy 16384-dimensional landscape of smeared, distorted, and white-noise-covered images that kinda sorta look vaguely like what we want if you squint (well, if the neural network squints, anyway). If we try to take a big step, we don't descend the gradient faster; we fly off in a mostly random direction, clip through various proverbial cliffs, and probably end up somewhere higher up the gradient than we started.
The problem is that predicting a pixel requires knowing what the pixels around it looks like. But if we start with lots of noise, then the neighboring pixels are all just noise and have no signal.
You could also think of this as: We start with a terrible signal to noise ratio. So we need to average over very large areas to get any reasonable signal. But as we increase the signal, we can average over a smaller area to get the same signal-to-ratio.
In the beginning, we're averaging over large areas, so all the fine detail is lost. We just get 'might be a dog? maybe??'. What the network is doing is saying "if this a dog, there should be a head somewhere over here. So let me make it more like a head". Which improves the signal to noise ratio a bit.
After a few more steps, the signal is strong enough that we can get sufficient signal from smaller areas, so it starts saying 'head of a dog' in places. So the network will then start doing "Well, if this is a dog's head, there should be some eyes. Maybe two, but probably not three. And they'll be kinda somewhere around here".
Why do it this way?
Doing it this ways means the network doesn't need to learn "Here are all the ways dogs can look". Instead, it can learn a factored representation: A dog has a head and a body. The network only needs to learn a very fuzzy representation at this level. Then a head has some eyes and maybe a nose. Again, it only needs to learn a very fuzzy representation and (very) rough relative locations.
So it only when it get right down into fine detail that it actually needs to learn pixel perfect representation. But this is _way_ easier, because in small areas images have surprisingly very low entropy.
The 'text-to-image' bit is a just a twist on the basic idea. At the start when the network is going "dog? or it might be a horse?", we fiddle with the probabilities a bit so that the network starts out convinced there's a dog in there somewhere. At which point it starts making the most likely places look a little more like a dog.
I suppose that static plus a subliminal message would do the same thing to our own numeral networks. Or clouds. I can be convinced I’m seeing almost anything in clouds…
Research is still ongoing here, but it seems like diffusion models despite being named after the noise addition/removal process don't actually work because of it.
There's a paper (which I can't remember the name of) that shows the process still works with different information removal operators, including one with a circle wipe, and one where it blends the original picture with a cat photo.
Also, this article describes CLIP being trained on text-image pairs, but Google's Imagen uses an off the shelf text model so that part doesn't seem to be needed either.
If you removed all of the noise in a corrupted image in one step, you would have a denoising autoencoder, which has been around since the mid-aughts or perhaps earlier. Denoising diffusion models remove noise a little bit at a time. Think about an image which only has a slight amount of noise added to it. It’s generally easier to train a model to remove a tiny amount of noise than a large amount of noise. At the same time, we likely introduced a small amount of change to the actual contents of the image.
Typically, in generating the training data for diffusion models, we add noise incrementally to an image until it’s essentially all noise. Going backwards from almost all noise to the original images directly in one step is a pretty dubious proposition.
I was wondering the same and this video [1] helped me better understand how the prediction is used. The original paper isn't super clear about this either.
The diffusion process predicts the total noise that was added to the image. But that prediction isn't great and applying it immediately wouldn't result in a good output. So instead, the noise is multiplied by a small epsilon and then subtracted from the noisy image. That process is iterated to get to the final result.
You can think of it like solving a differential equation numerically. The diffusion model encodes the relationships between values in sensible images (technically in the compressed representations of sensible images). You can try to jump directly to the solution but the result won't be very good compared to taking small steps.
I’m pretty sure it’s a stability issue. With small steps the noise is correlated between steps; if you tried it in one big jump then you would essentially just memorize the input data. The maximum noise would act as a “key” and the model would memorize the corresponding image as the “value”. But if we do it as a bunch of little steps then the nearby steps are correlated and in the training set you’ll find lots of groups of noise that are similar which allows the model to generalize instead of memorizing.
1- Forward Diffusion (adding noise, and training the Unet to predict how much noise is added in each step)
2- Generating the image by denoising. This doesn't predict the final image, each step only predicts a small slice of noise (the removal of which leads to images similar to what the model encountered in step 1).
So it is indeed an iterative processes in that way, each step taking one step towards the final image.
Dark was infuriating to watch. Visually stunning, great characters and performances, but ultimately an intentionally obfuscated and meandering story that only ends up disappointing you.
It's kind of similar to Westworld season 1 or Lost -- there's this style of storytelling where you drop a bunch of interesting clues to lead you towards possible theories about whats really going on, and then all of those threads are dropped on the floor as if they never mattered. I find it abusive and disrespectful to the viewer.
First on a positive note, the example about attention on sex and arousal feeding back on itself and deepening the experience is well described and easy to relate to. But I think the "deepening an experience through attention" phenomenon applies in so many other domains as well - Sustained attention on a film or video game world, deep uninterrupted creative work for many hours, etc. It's a wonderful positive feedback loop.
It is somewhat similar to how when sitting in silence outside for a long period of time you begin to become aware of more and more subtle details of the experience that weren't immediately accessible. Almost like you're turning up the sensitivity knob on things.
Unfortunately as the author describes, the attention feedback loop can become unpleasant and even torturous when it is directed on negative sensations. For me it has been various things at different stages of my life - muscle tension, breathing, eye floaters in my vision, etc. The same process plays out - Sustained fixation of attention on the sensation increases your sensitivity to it, meaning you notice it more and it bothers you more, meaning you pay more attention to it, and it gets out of control.
The difficulty I experience is that this attention is unwanted and yet I feel my mind focus on it almost automatically. Paradoxically, most of the treatment/recovery advice for this type of OCD is to allow these sensations to be there without rejecting them, which I'm still working on.
But it is helpful to see the positive flip side of the coin too - Our minds are capable of deep focus and deep attention, which can increase sensitivity and let you see increasingly subtle details of experience, making you a better appreciator of art and life, a better creator, a better listener and friend, etc.