Humans have far higher resolution sensors and the most advanced computer ever seen behind them. And even that fails at a far higher rate than we want self driving cars to be.
This is really not true. The phone in your pocket is just plain better. More total pixels per unit area, much greater dynamic range, able to sample over time periods an order of magnitude faster.
The reason your vision seems better is because your brain is amazingly good at synthesizing a picture of the world around you. But all that data (the sphere around you is something like 150 MP at eye resolution) is an illusion. You're only looking at a million pixel(-equivalents) at once, thereabouts.
[1] Your eyes come close only in the middle half a degree of your fovea. Everywhere else your brain gets a blurry mess and has to extrapolate.
I agree that, ultimately, vision and some IMUs will be all that's needed for an as-perfect-as-possible self-driving cars.
Your statement on phone cameras being better than the human eye isn't true today, though.
The human eye has FAR better dynamic range than any camera built to date, at any price point. This matters a great deal on sunny, cloudless days in a city with tall buildings casting dark shadows, for example.
Just like how the biological system of "stupid sensor, smart thinking" helps biological organisms, that's what is going to need to be done for computer vision as well.
About 1/3 of the human brain is dedicated to vision processing in some way. Think about that a second. One THIRD of the best, most powerful organic computer known is required for us to see what we see and we are still fooled by the things human eyes can be tricked by. It's going to take a lot of neural network training to duplicate that. Fortunately, the vision skills required to drive are a subset of overall vision capabilities.
> The human eye has FAR better dynamic range than any camera built to date, at any price point.
Not at all true. All you need to do is take your camera into a dark room for proof. It can take useful pictures in environments you can't see. And with some manual control and safety precautions (seriously, don't actually do this) you'll note you can shoot useful photos of things that are very near bright sources like the sun where your eyes would be completely useless (and irrevocably damaged, again do not try this).
What you're complaining about isn't dynamic range, it's exposure control. A human brain, again, makes much better decisions about what parts of the environment are "important" when setting the aperture (iris). So the stuff you want to see is visible, where the camera is mostly just going to guess that the center of the frame is what you want and will routinely leave stuff over or underexposed.
Your last paragraph has the correct analysis but the wrong conclusion. It's the vision processing that makes the difference. The optical systems of a semiconductor camera absolutely are better, so a control system based on optics can absolutely be better.
Dynamic range is the ability to see things in both direct sunlight AND in dark shadows at the same time. You need to be able to see pedestrians that are both directly lit by the sun and in the darkest shadows in the same frame of video.
No long exposures, no multiple exposures with varying exposure -- a single frame. I don't know of any 24-32-bit per channel sensors, and that's what you'd need.
We agree on vision processing. I thought I made it clear that it's a strong "backend" to vision (the brain or compute behind it) that makes it work so well, and that will be the case with good self-driving cars as well. I must have misstated something.
> I don't know of any 24-32-bit per channel sensors, and that's what you'd need.
Dynamic range doesn't require bit depth. You can have an 8 bit sensor with 20 stops of dynamic range. You'll lose color/luminance resolution of course but as long as 255 is capturing a light that's 2^20 times stronger than 0 that's 20 stops of dynamic range.
But more than that theoretical point there are already sensors pushing well beyond what humans can do in terms of dynamic range. Apparently humans have a respectable 10 stops and sensors are already in the 15 to 20 stop range and pushing beyond it:
To map it into 16bit values you can just use a curve to distribute the bit depth unevenly across the dynamic range. Older DSLRs did that to get perfectly usable images with just 10 bits per channel.
You're fooling yourself if you think your eyes can do that. They can't. More to the point, your brain is fooling you, by synthesizing a perception of a reality, the same way it does to make you think the stuff in your peripheral vision is as sharp as the things in the middle.
Vertebrate eyes don't have 24 bits of sensitivity, that's just insane. Rod cells are neurons, they either fire or they don't, can fire at most at about 10 Hz, and you have about a million of them at most across your whole field of vision. Do the math. What you say isn't even physically possible.
You're correct that the brain fills in a lot that isn't there, making you think your vision is a lot better than it is. It doesn't change much, though, mostly it fills in gaps.
10Hz? Simply not true. Fighter pilots can identify aircraft shown on a screen for 1/250th of one second. Regular schmoes can see the difference between 30Hz and 60Hz video easily.
I am not arguing about pixel count of a camera vs the center of vision of human eyes.
I'm saying that... Nevermind. I've already said it several times and apparently you have a PhD in all things vision.
Being able to recognize aircraft on a screen for 1/250th of a second just means the human eye/brain has persistence, not actual useful 250Hz bandwidth.
You're swapping dynamic range and exposure control. Your examples of dark rooms and bright light sources are examples of great exposure control, not dynamic range. Dynamic range is the difference in light intensity from pure black to pure white in your sensor. Modern sensors are also extremely good at dynamic range though, much better than film and apparently much better than our eyes as well. From a quick search our eyes are at 10 stops of range (2^10 ratio from white to black) and modern sensors have gone past 15 stops and are pushing on 20 apparently, which is just amazing.
It’s absolutely false. The eye simulates the dynamic range by continuously changing the pupil size.
If you take a camera with a lens that can go from f/1.4 to f/32 you have a much much bigger dynamic range counting that you can also modify the exposure time.
Your comparison between the eye with his changing pupil size and a single image taken at a fixed aperture value and a fixed exposure is totally incorrect.
Have you... actually tried that? Night shots with a decent camera sensor work much better than your eyes. You're probably being fooled by the exposure control on your junky consumer phone app. Grab one with manual control over exposure/iso and sample rate, you'll be shocked.
It's not even a secret. Light capture in a phototransistor[1] is a basically direct process without a whole lot of loss or wasted surface area. Those cone cells are living things and only have a little volume to dedicate to pigment chemistry.
[1] CCD's of course can do an order of magnitude better still, and those are easily cheap enough to put in a car.
How are you arguing that ccd’s are better than human eyes? Take video at night and compare. Why do you think hollywood, with their 100k+ camera rigs, still need to add artificial light everywhere especially night scenes? Eyes are incredible at dynamic range, resolution, and response rate. They have to be due to evolution.
If you want to see how consumer cameras can take night video much better than your eyes look up Sony A7S video. Modern cameras have such clean output at high ISO that you can turn night into day with great quality.
Modern consumer cameras have far surpassed what our eyes can do in all the metrics you describe (dynamic range, resolution, response rate) and many others. It's not even close.
Conversely computers don't get tired, or "distracted", or drunk. They're also faster - human cognition is running about 150-200ms behind what you actually see. Some reflexes are running faster then that, but overall the human platform is not being used with that super-computer in full-control while driving at high speed - our hardware isn't capable of it.
The thing we most bring to driving is navigating complex, low speed environment changes - but we're not great at that either (see the number of toddlers run over in their drive ways, for example).
The resolution outside of the fovea is pretty poor. It's a little unclear what portion of accidents we could cut out if we had human level performance, but never let the attention waver or disallowed reckless driving. Judging from NMVCCS it'd cut out somewhere between 34 - 75% of accidents.
I'm not threeseed, but I'm pretty sure they meant what they said.
Expanded version of the argument: "Karpathy says 'people drive vision-only' and apparently intends that to convince us that vision-only is good enough. But (1) those people driving vision-only are using human brains, whose abilities we have not yet come close to duplicating, and (2) even with the astonishing abilities of the human brain, those people driving vision-only make a lot of mistakes and have a lot of accidents, and we want self-driving cars to be much safer than that. So the fact that people drive vision-only is no reason to think that self-driving cars should do likewise. They're trying to be safer than people, with less computing power; why shouldn't we make up for that by giving them extra sensors?"
Humans have far higher resolution sensors and the most advanced computer ever seen behind them. And even that fails at a far higher rate than we want self driving cars to be.