LLMs that use Chain of Thought sequences have been demonstrated to misrepresent their own reasoning [1]. The CoT sequence is another dimension for hallucination.
So, I would say that an LLM capable of explaining its reasoning doesn't guarantee that the reasoning is grounded in logic or some absolute ground truth.
I do think it's interesting that LLMs demonstrate the same fallibility of low quality human experts (i.e. confident bullshitting), which is the whole point of the OP course.
I love the goal of the course: get the audience thinking more critically, both about the output of LLMs and the content of the course. It's a humanities course, not a technical one.
(Good) Humanities courses invite the students to question/argue the value and validity of course content itself. The point isn't to impart some absolute truth on the student - it's to set the student up to practice defining truth and communicating/arguing their definition to other people.
First, thank you for the link about CoT misrepresentation. I've written a fair bit about this on Bluesky etc but I don't think much if any of that made it into the course yet. We should add this to lesson 6, "They're Not Doing That!"
Your point about humanities courses is just right and encapsulates what we are trying to do. If someone takes the course and engages in the dialectical process and decides we are much too skeptical, great! If they decide we aren't skeptical enough, also great. As we say in the instructor guide:
"We view this as a course in the humanities, because it is a course about what it means to be human in a world where LLMs are becoming ubiquitous, and it is a course about how to live and thrive in such a world. This is not a how-to course for using generative AI. It's a when-to course, and perhaps more importantly a why-not-to course.
"We think that the way to teach these lessons is through a dialectical approach.
"Students have a first-hand appreciation for the power of AI chatbots; they use them daily.
"Students also carry a lot of anxiety. Many students feel conflicted about using AI in their schoolwork. Their teachers have probably scolded them about doing so, or prohibited it entirely. Some students have an intuition that these machines don't have the integrity of human writers.
"Our aim is to provide a framework in which students can explore the benefits and the harms of ChatGPT and other LLM assistants. We want to help them grapple with the contradictions inherent in this new technology, and allow them to forge their own understanding of what it means to be a student, a thinker, and a scholar in a generative AI world."
I'll give it a read. I must admit, the more I learn about the inner workings of LLM's the more I see them as simply the sum of their parts and nothing more. The rest is just anthropomorphism and marketing.
Whenever I see someone confidently making a comparison between LLMs and people, I assume they are unserious individuals more interested in maintaining hype around technology than they are in actually discussing what it does.
Someone saying "they feel" something is not a confident remark.
Also, there's plenty of neuroscience that is produced by very serious researchers that have no problems making comparisons between human brain function and statistical models.
Current LLMs are not the end-all of LLMs, and chain of thought frontier models are not the end-all of AI.
I’d be wary of confidently claiming what AI can and can’t do, at the risk of looking foolish in a decade, or a year, or at the pace things are moving, even a month.
That's entirely true. We've tried hard to stick with general principles that we don't think will readily be overturned. But doubtless we've been too assertive for some people's taste and doubtless we'll be wrong in places. Hence the choice to develop not a static book but rather living document that will evolve with time. The field is developing too fast for anything else.
I think that’s entirely the problem. You’re making linear predictions of the capabilities of non-linear processes. Eventually the predictions and the reality will diverge.
Every time someone claimed “emerging” behavior in LLMs it was exactly that. I can probably count more than 100 of these cases, many unpublished, but surely it is easy to find evidence by now.
Not quite, but it was the closest pithy quote I could think of to convey the point that things can be false for a long time before they are suddenly true without warning.
How about "Yes, they laughed at Galileo, but they also laughed at Bozo the Clown?"
We heard alllllll the same hype about how revolutionary the blockchain was going to be and look how that turned out.
It's a virtue to point out the emperor has no clothes. It's not a virtue to insist clothes tech is close to being revolutionary and if you just understand it harder, you'd see the space where the clothes go.
The post seems to be talking about the current capabilities of large language models. We can certainly talk about what they can or cannot do as of today, as that is pretty much evidence based.
The ground truth is chopped off into tokens and statistically evaluated. It is of course just a soup of ground truth that can freely be used in more or less twisted ways that have nothing to do or are tangent to the ground truth. While I enjoy playing with LLMs I don't believe they have any intrinsic intelligence to them and they're quite far from being intelligent in the same sense that autonomous agents such as us humans are.
Any all of the tricks getting tacked on are overfitting to the test sets. It's all the tactics we have right now and they do provide assistance in a wide variety of economically valuable tasks with the only signs of stopping or slowing down is data curation efforts
I've read that paper. The strong claim, confidently made in the OP is (verbatim) "they don’t engage in logical reasoning.".
Does this paper show that LLMs "don't engage in logical reasoning"?
To me the paper seems to mostly show that LLMs with CoT prompts (multiple generations out of date) are vulnerable to sycophancy and suggestion -- if you tell the LLM "I think the answer is X" it will try too hard to rationalize for X even if X is false -- but that's a much weaker claim than "they don't engage in logical reasoning". Humans (sycophants) do that sort of thing also, it doesn't mean they "don't engage in logical reasoning".
Try running some of the examples from the paper on a more up-to-date model (e.g. o1 with reasoning turned on) it will happily overcome the biasing features.
I think you'll find that humans have also demonstrated that they will misrepresent their own reasoning.
That does not mean that they cannot reason.
In fact, to come up with a reasonable explanation of behaviour, accurate or not, requires reasoning as I understand it to be. LLMs seem to be quite good at rationalising which is essentially a logic puzzle trying to manufacture the missing piece between facts that have been established and the conclusion that they want.
As a learning exercise, I enjoyed Neural Networks From Scratch: https://nnfs.io/
There's also a world of statistics and machine learning outside of deep learning. I think the best way to get started on that end is an undergrad survey course like CS189: https://people.eecs.berkeley.edu/~jrs/189/
Maybe I'm the outlier here, but 15 minutes to chat with a human about my use case and pricing is way more efficient than donking around in docs/trial product.
The only product I really want to punch in credit card info and GO is commodity software (e.g. AWS EC2 or a domain registration service.
I think wires sometimes get crossed in pricing/sales models, where an enterprise product gets priced like commodity software ... but that's usually a sign the company is immature. There shouldn't be a sales team for software that costs 2-3 figures. Software costing 5-6+ figures absolutely requires people in the sales/onboarding process, because a big part of what I'm paying for is support.
Maybe I’m not asking the right questions, but I consistently find that I get “Yes” answers in these calls, that turn out to actually be “No” in practice.
I think the problem is that we rarely want to know “can you meet this use case”, but rather “how well can you meet this use case”, and that’s hard to assess without putting your hands on the software.
Which is to say that the quality of the sales person matters.
If your sales department is staffed by people who got hired on Monday, and are on the phone by Friday, then frankly they're not worth much.
I've seen the opposite though where sales folk know more about the software than support folk. They're equipped to help you with choices, but also understand limits and high-cost areas. Yes you absolutely can get Custom Reports, but we absolutely charge for that. And the data you're looking for is on this built-in report....
Dealing with a good salesperson, who knows their stuff, and understands that truth and trust are important, is an amazing thing.
It's definitely a generational thing. I've been spammed so utterly often that I simply do not answer my phone for a non-contact (or the inevitable interview phone call. But less often with video calls these days). If it's important enough to contact me, it's important enough to leave a voice mail.
I don't really do these sale pitches often, but it's a similar mentality for a different reason. I simply want anything communicated in writing in case they try to say yes to put a foot in the door, but the small details say no.
I presume you are an emergency contact for some people? Maybe a spouse, or a kid? Or even a friend? What's your contingency for when they are lying bleeding somewhere and someone can't reach you since they are not on your contact list?
I'm not the guy you asked, but I also basically keep my phone in Do Not Disturb mode 24/7, meaning no calls, no texts, no notifications, ever. I choose when I have time to look at my messages, not the other party.
I'm not a doctor, and even if I was, I'll never be able to help them purely over the phone if they are "lying bleeding somewhere" and I'm not around. If my house is burning down and I'm away, what am I going to do about it remotely that a phone call will solve? I'm not a firefighter and I can't splash water over RF. If something happens at my kid's school, I'm not there, and even if I was, I probably wouldn't be able to do anything about it.
That being said, if someone really, really thinks that I can somehow help them over the phone in an emergency, despite my number not being 9-1-1, certain family and friend's numbers I allow to punch through DND and reach me.
Not particularly, no. But I imagine they would simply say "Johnny it's X, call me. It's urgent". (Scammers are bad, but I've never been tricked with that kind of line).
If I'm being frank, that extra minute for me to respond probably won't change their fate if they are indeed bleeding out somewhere.
Yeah, if you answer calls from random numbers something unsettling usually happens like some Indian guy yelling at me about how the IRS wants their money and the cops are on the way to my house (punchline: he can make it stop if I give him my bank account information). Unfortunately, it works on the kind of people who answer phone calls.
One could ensure their spouse or child knows about 911 in America or similar service in other countries, which is, of course, what should be called in such a circumstance firstly anyway. Also, people generally have such numbers as contacts in their phone ... I don't know why I'm explaining it; this just seems like common sense ...
Emergency calls often come from people NOT in your contacts. That's why you provide emergency contacts on forms. If something goes wrong at work for example, someone from the office would call, not your spouse themselves.
blue-collar, unemployed (NEET, even), my oldest kids, PhDs i've known personally for decades, white collar workers - an incomplete list of everyone i called in 2024 with a full voicemailbox.
i try to tell people about the "two calls in a minute lets it go thru" feature because as of yet the autodialers don't know about it or have it implemented.
They're probably not also people who don't answer their phone, preferring either to have a voice message or nothing (because it isn't important) then.
I do the same as commenter up-thread, my voicemail inbox is empty. Sometimes I let a call ring out and then listen to the message immediately, I just don't want to have to deal with it synchronously. Then if it's 'I have a number of opportunities that seem like a great fit for you' I can just delete it and move on with my day, not have to try to say no-bye politely before hanging up, for example.
Someone with a full inbox is more likely someone who does the opposite - they'll never listen to their messages because they want to talk to someone, you'd have to call them back anyway so no harm it's full. (Or they'd call you from the missed call, not because they heard your message.)
> Or they'd call you from the missed call, not because they heard your message.
people with professional and PhD with full mailboxes do this, with jitter of up to hours.
the last full voicemail i hit i was called back nearly immediately, and they said "what, i just delete all my voicemails, if i call it says no new no saved"
There's a reason i'm harping on specific elements so much, because i don't think voicemail is magic, i guess.
I don't really disagree. The problem is outreach when you're clearly mostly researching something whether to do with computers or something else. One travel company is particular was pretty aggressively reaching out because I downloaded a couple brochures.
Are there any mechanisms to balance out the "race to the bottom" observed in other types of academic compensation? e.g. increase of adjunct/gig work replacing full-time professorship.
Do universities require staff to perform a certain number of reviews in academic journals?
Normally, referees are unpaid. You're just supposed to do your share of referee work.
And then the publisher sells the fruits of all that work (research and refereeing) back to universities at a steep price. Academic publishing is one of the most profitable businesses on the planet! But univesities and academics are fighting back. Have been for a few years, but the fight is not yet over.
> Do universities require staff to perform a certain number of reviews in academic journals?
No. Reviewers mostly do it because its expected of them, and they want to publish their own papers so they can get grants
In the end, the university only cares about the grant (money), because they get a cut - somewhere between 30-70% depending on the instituition/field - for "overhead"
Its like the mafia - everyone has a boss they kick up to.
My old boss (PI on an RO1) explained it like this
Ideas -> Grant -> Money -> Equipment/Personnel -> Experiments -> Data -> Paper -> Submit/Review/Publish (hopefully) -> Ideas -> Grant
If you don't review, go to conferences/etc. its much less likely your own papers will get published, and you won't get approved for grants.
Sadly there is still a bit of "junior high popularity contest" , scratch my back I'll scratch yours that is still present in even "highly respected" science journals.
I hear this from basically every scientist I've known. Even successful ones - not just the marginal ones.
While most of what you write is true to some extend, I do not see how reviewing will get your paper published, except maybe for the cases the authors can guess the reviewer. It's anonymous normally.
I don't thing it's a money problem. It's more like a framing issue, with some reviewers being too narrow-minded, or lacking background knowledge on the topic of the paper. It's not uncommon to have a full lab with people focussing on very different things, when you look in the details, the exact researchers interests don't overlap too much.
Typically, at least in physics (but as far as I know in all sciences), it's not compensated, and the reviewers are anonymous. Some journals try to change this, with some "reviewer coins", or Nature, which now publishes reviewer names if a paper is accepted and if the reviewer agrees. I think these are bad ideas.
Professors are expected to review by their employer, typically, and it's a (very small) part of the tenure process.
It's implicitly understood that volunteer work makes the publishing process 'work'. It's supposed to be a level playing field where money does not matter.
Do universities require staff to perform a certain number of reviews in academic journals?
Depends on what you mean by "require". At most research universities it is a plus when reviewing tenureship files, bonuses, etc. It is a sign that someone cares about your work, and the quality of the journal seeking your review matters. If it were otherwise faculty wouldn't list the journals they have reviewed for on their CVs. If no one would ever find out about a reviewers' efforts e.g. the process were double blind to everyone involved, the setup wouldnt work.
There is no compensation for reviewers, and usually no compensation for editors. It’s effectively volunteer work. I agree to review a paper if it seems interesting to me and I want to effectively force myself to read it a lot more carefully than normal. It’s hard work, especially if there is a problem with the paper, because you have to dig out the problem and explain it clearly. An academic could refuse to do any reviews with essentially no formal consequences, although they’d get a reputation as a “bad citizen” of some kind.
A less literal translation like "essentially" or "in essence" is deployed by master Latin translators like Robert Fagles. I've even seen "in a vacuum" which does a better job at communicating the original intent than a string of cryptic prepositions.
Is living near a sports stadium really that desirable? I lived a few blocks from the Giants stadium in SF, and I'll never make that mistake again.
Running a 15 minute errand on a game day could take hours. It was impossible to get my car out of the garage or get on/off the highway. The food/trash left on the streets was terrible too, which made walking my dog a PITA instead of a pleasure.
I think the point is that if you live in an American city with a professional sports team, you’re living in an area that offers way more than just a sports team; there are many things of note to do within a 20mi radius.
People who choose to live, say 250mi from the nearest major professional sports team are going to have a ton less job opportunity, things of note to do, but will generally have a lot less to pay because no one else wants to live there.
Ah that makes more sense, thank you! I wasn't looking at it as a proxy for surrounding development, but now I can see why that might be a more nuanced metric than just region size/population.
I think for every student thoughtfully using ChatGPT, there are a dozen who mindlessly dump homework in and copy the output verbatim.
I'm taking classes at a community college for fun, and it's frankly disturbing how reliant the 18-20 y/o crowd is on ChatGPT to do basic tasks. There's also so much unfounded trust in the output of LLMs. At least once per week, I've heard a student arguing with a tutor/professor because their ChatGPT generated homework was marked incorrect - they argue the answer key must be wrong.
I do think LLMs have a place in education, but right now I see them exacerbating existing problems in high school / college aged generations. There's a low tolerance for frustration and grappling with a new problem, or applying past learning to a new situation.
> I think for every student thoughtfully using ChatGPT, there are a dozen who mindlessly dump homework in and copy the output verbatim.
is it fair to say that this same argument applies universally for everything?
> I'm taking classes at a community college for fun, and it's frankly disturbing how reliant the 18-20 y/o crowd is on ChatGPT to do basic tasks
I understand where you are going with this but this tech is here and it is not going anywhere and it will be interwoven into every aspect of our lives - these are just facts. At work a year ago it was like few of us were playing around with it, now there is no one on the team that isn't using claude/cursor/copilot/chatgpt... and if we had someone that didn't fairly certain they would not last more than a few months. I think instead of fighting it you should embrace it. if you have unfounded trust (rightfully so) check and check and verify and check and verify. even with all that it is amazing piece of tech without which you will be at a disadvantage both in school as well as work as well as...
> I do think LLMs have a place in education, but right now I see them exacerbating existing problems in high school / college aged generations. There's a low tolerance for frustration and grappling with a new problem, or applying past learning to a new situation.
ohhhh 1000000% but we have to keep in mind that the tech is in its infancy and there will be growing pains...
I'd want to know about the results of these experiments before casting judgement either way. Generative modeling has actual applications in the 3D printing/mechanical industry.
That sounds like good work, but we can't ignore the context. Nvidia can train their own LLM's on proprietary Nvidia designs, which isn't a possibility for a random startup.
If the evaluation of the approach is "it works great if you train it on a few decades of the best designs from a successful fabless semiconductor company", I would say that if you plan to use that method as a startup, you're clearly going to fail. Nobody's going to give away their crown jewels to train an LLM that designs chips for other companies.
The problem _there_ is that there's very little diversity in the training data - it's all NVidia designs which are probably from the same phylogenetic tree. It'll probably end up regurgitating existing NV designs...
I ditched technology and switched to paper planners, in particular Japanese planners with time columns and enough space to dot down daily notes/thoughts.
After years of being tethered to Slack and other productivity apps, the only ones I use now are Google calendar (coordinating meetings with other people) and email for communication/correspondence.
I think it's been helpful for coping with ADHD. Attention is a finite resource that you only get so much of in a day, and everything online is fighting for it.
I'd love to ditch slack, unfortunately it, or a competitor, is the standard comms software for all jobs I've ever had.
I managed to somewhat successfully reduce the distraction by muting most channels and blocking black out times where I simply turn it off for 2-3 hours a time
I have a whole "chop wood, carry water" speech born from leading corporate software teams. A lot of work at a company of sufficient size boils down to keeping up with software entropy while also chipping away at some initiative that rolls up to an OKR. It can be such a demotivating experience for the type of smart, passionate people that FANNGs like to hire.
There's even a buzzword for it: KTLO (keep the lights on). You don't want to be spending 100% of your time on KTLO work, but it's unrealistic to expect to do done of it. Most software engineers would gladly outsource this type of scutwork.
Some places also call this "RTB" for "run the business" type work. Nothing but respect for the engineers who enjoy that kind of approach, I work with several!
So, I would say that an LLM capable of explaining its reasoning doesn't guarantee that the reasoning is grounded in logic or some absolute ground truth.
I do think it's interesting that LLMs demonstrate the same fallibility of low quality human experts (i.e. confident bullshitting), which is the whole point of the OP course.
I love the goal of the course: get the audience thinking more critically, both about the output of LLMs and the content of the course. It's a humanities course, not a technical one.
(Good) Humanities courses invite the students to question/argue the value and validity of course content itself. The point isn't to impart some absolute truth on the student - it's to set the student up to practice defining truth and communicating/arguing their definition to other people.
[1] https://arxiv.org/abs/2305.04388