URSABLOG: I Don’t Know
In the run up to exam season at the Institute of Chartered Shipbrokers, I became alarmed at the number of practice essays written by students that were similar to each other, even so far as the mistakes made were similar too. In other cases some of the essays were so spectacularly wrong – clauses in standard forms misnumbered or completely misdescribed – that it was almost laughable. What was going wrong? And then I realised – AI was going wrong. Whatever ChatGPT was drinking, I wanted some, so crazy were some of the hallucinations.
For all the fears of AI changing the world, not enough is being said about how wrong it can be. I am continually told that it’s my fault, that I’m asking the wrong questions, or asking them the wrong way. But that is hardly an excuse: a human being can interpret a question more flexibly and still understand it, or has the ability to ask supplementary questions. AI – or the form that I am most used to, ChatGPT – seems to come up with an answer, whatever it is, and because it has to come up with something – anything – it often gets it wrong. If I ask a question that a human being can answer – and get right – and if I ask the same question to AI and it gets it wrong, who do you trust more?
Of course sometimes the human being gets it wrong, and in doing so has to suffer the consequences – fails the exam, misses the deal, doesn’t get the promotion – but AI doesn’t have to suffer the same ignominy of failure. It just seems to smile at you and dare you to prove it. And if you challenge it, it will say either something like “Ah, now that you’ve clarified it, I understand what you mean” or “You were right to pull me up on that, and I will never do it again.” Until the next time. Either passive-aggressive pushback, or fickle repentance. Neither works for me.
But the reason that I spot AI generated essays? It’s the absolute confidence with which the factually incorrect statements are made.
“It’s early days” people say, or worse “It just hasn’t learnt enough yet, it will get better.” My reply then is if it’s trying to get better either AI should say “I don’t know” or even better “Can you rephrase that for me? What are you trying to say?” Instead it’s like a desperate potential lover, anxiously determined to please, and losing all sense of self-dignity in the process, making themself less attractive as time goes on, the more it tries to dig itself out of its own hole. Sooner or later this chutzpah gets boring, and people will turn away. “All mouth and no trousers” may be amusing and innovative to start off with in a relationship, but after a while it gets boring, and people switch off and walk away, or worse, don’t engage at all.
It’s not as if the techies don’t know about this. In a short paper last month, a team from OpenAI and the Georgia Institute of Technology proved that even with flawless training data, LLMs can never be all-knowing. One of the reasons is that some questions are just unanswerable. The root problem, the researchers found, could be because of how their performance is measured: the benchmarks used reward confident guesses and penalise honest uncertainty.
I keep coming across people who are trying to feed data into Large Language Models (LLMs) to try and assess market sentiment and make more accurate predictions. Philosophically – and practically – this seems like a blind alley. Surely we know by now that, however much we know of what happened in the past, or even what people are thinking about what is happening today, this gives us no accurate guide to what will happen. “Past performance is no guide to future performance” should be ringing some bells.
Of all the work done recently in behavioural economics and forecasting, one of the things that came through – to me at least – was that the most accurate forecasters were those who were willing to absorb new data, accept when they were incorrect, and change their minds. However much data LLMs read, if their core ambit is to waffle on like an over-confident undergraduate thinking that they know better than someone else simply because they just thought it, because that’s how their performance is measured, then how much can we really expect of them?
Let’s take shipping for example. Are you seriously telling me that an LLM fed on a diet of TradeWinds, Lloyds List and shipbrokers reports will then be able to forecast better? The data that goes into these publications is hardly infallible: multiple parties with their own interests are pushing their own point of view all of the time. This is opinion, spin, manipulation – call it what you will – not hard data. Most shipbrokers’ reports err on the positive, because they think that the positive will be to their benefit. They pray for good markets, and so like true believers everywhere when a good market is not favouring them, they highlight signs of a potentially good market where they can find them, ignoring the storm until it is on them and they have to admit that they are drenched through. If there is a lack of pessimistic commentary, or even phlegmatic analysis, what are LLMs supposed to think?
But LLMs cannot think, and they don’t have minds, at least not like us, which at least they admit, if somewhat reluctantly, but only if you prompt them.
I like to think of the human mind as not something physical, in the way that the brain is. The brain is the computer, the engine, absorbing and processing huge amounts of data all the time. The mind is something else: it exists in a kind of cloud somewhere between us and the rest of the world. This is the part of us that engages with the rest of the world, and is the creative (or destructive) part of us that makes us who we are. We can only be ourselves in the presence of others, however irritating or painful that is.
When you look into someone’s eyes, your mind is already processing what they are thinking, how they are feeling, who they are – even who they are on that particular day – and your eyes (connected to the brain) are informing your mind how to move in the world. The brain powers, processes and stores away this data so your mind can use it. The mind can picture things that don’t exist, create things, hold abstract thoughts and develop them.
Hallucinations – other than those caused knowingly by imbibed substances – are usually due to problems with the brain, not the mind. Hallucinations in AI are caused because of how it is programmed to answer questions. The fault lies not necessarily in the ‘learning’ but in what it has been programmed to do. It turns out that OpenAI and the others think their LLMs must produce an answer at all costs rather than admit ignorance, which would be a failure in their business model.
Celina Zhao, reporting on the report in the journal Science, confirms this:
High benchmark scores translate into prestige and commercial success, so companies often tune their post-training to maximize benchmark scores. However, nine out of the 10 most popular benchmarks the researchers analyzed grade a correct answer as a 1 and a blank or incorrect answer as a 0. Because the benchmark doesn’t penalize incorrect guesses more than nonanswers, a fake-it-till-you-make-it model almost always ends up looking better than a careful model that admits uncertainty.
Equating a blank answer with an incorrect one, or judging a partially correct answer – which can be even more misleading – better than no answer is unlikely to lead to better answers in the future, let alone better forecasting.
Which seems to me to be the right time to bring up Socrates: “I seem, then, in just this little thing to be wiser than this man at any rate, that what I do not know I do not think I know either.” Well.
I was sitting outside my favourite wine bar in Piraeus with a friend nursing my bruises after a strenuous Posidonia week when an acquaintance of mine came out and greeted me. He was developing a start-up using AI in shipping somehow or other; I was too weary to enquire deeply. I mentioned the Institute of Chartered Shipbrokers – the mutual connection behind our acquaintance – and he peremptorily dismissed my profession, my vocation, my livelihood:
“Shipbrokers will be a thing of the past very shortly; AI will replace you all very soon.”
And then wandered off into the night.
Apart from my anger – not too strong a word – at him for the disrespect shown the Institute to which we both belonged, it was the assurance that we were all about to become extinct that baffled me, and stayed with me. But – like AI – human beings are perfectly capable of saying stupid things when they don’t have the answer themselves.
Simon Ward
