Coding, trust and genAI

I learned to code in the 1980s for a Computer Science A-level. We used command line BASIC: my A-level project was to plot and integrate a quadratic function. It took months to write.

Since then I have done no coding except hacking about a few scripts and some HTML, so when opportunity arose, I took a crash course in Python. Setting aside the wow-factor of Jupyter notebooks, two things struck me:

  1. No GOTO function!!
  2. Code libraries

The more I think about the use of code libraries in modern coding, the more interesting it is. And it goes a long way to explaining why so many researchers in STEM disciplines have been surprisingly uncritical in their use of LLMs.

In and out of the black box

A function is a mapping of inputs to outputs. Importing a code library gives a programmer access to a large number of functions. The documentation will give a syntax for each and a brief description. How that mapping is achieved is not described. In theory one could go and look at the source code, but it is likely to be very long and written in a language like C++ which most coders don't know. So effectively it is a black box function.

Which raises the obvious question: why does everyone trust these code libraries?

My interest in this question is not from a security perspective but from an epistemological one. What are the implicit justifications for using code libraries, and if someone relies on those without making them explicit, will they end up trusting other black boxes where they shouldn't, e.g. in LLMs and genAI?

Justifications

From conversations with colleagues in Computer Science, it seems that trusting code libraries boils down to three things:

  1. Checking they give the results you expect
  2. The reputation of the author
  3. Widespread use in the community

Since - absent personal acquaintance - the reputation of the author depends upon the widespread use of this or other code they have written, we really have two:

  1. It seems to work for me
  2. It seems to work for others

Appearance-Reality

As a philosopher, much of my training is to investigate the appearance-reality divide, the gap between how things seem and how they really are, so this is right up my street.

Let's start with the first: it seems to work to me. How many trials do I need to be confident that this appearance tracks reality? Two? Ten? 100? What is a good sample size for a mathematical function which potentially maps an infinite range of inputs to an infinite range of outputs? Who knows, but what is important here is that it is really easy to check that the function is giving the right output, so even if you only run a few trials, you know that it is easy to check that it gives the right answer in any particular application down the line.

What about the second? On the one hand, we can say that this is dramatically increasing the sample size of the trials so it does justify trust. But on the other, it is a form of the argument from common consent, which used to be very popular with Christian theologians until they discovered sociology, namely that if everyone agrees then it must be right.

genAI

If you use genAI to create an image, you can tell on inspection whether the image is what you want: the prompt-to-image function is easy to evaluate because the appearance-reality gap for images is very small. If it looks good, it is good.

But what if you use genAI to summarise a document or do a literature survey or write minutes or ... In such cases, lots and lots of outputs will seem OK, even good, but actually be inaccurate or misleading. The appearance-reality gap for these cognitive tasks, where truth and accuracy matter, is much larger. And the only way to evaluate whether e.g. this is a good summary of a complex document is to read the document. So genAI hasn't saved any work at all?

The point was well put by Vincent Ginis when he said genAI is good for tasks where the output is hard to generate "and easy to verify".

Now, if you do ask genAI to summarise a PhD thesis, for example, or do a literature survey, it will give very plausible results. Just like a code library. And if you are not trained in disciplines where many very plausible results are actually seriously wrong (and it can take great skill to show they are wrong), you will find this to be good evidence that the black box of genAI is a seeful code library for a range of cognitive tasks.

Add some #AIhype

I have been to many talks by AI enthusiasts from STEM disciplines who show a graph of how quickly it took various different technologies to reach 200 million users. And ChatGPT comes out at the top (if you ignore Threads). Why does this matter? What point are these speakers trying to make?

This is the argument from common consent at work: genAI tools can only be this popular if they are doing something worthwhile. Of course, that might just be amusing and entertaining people, but since it appears to have lots of practical uses, the inference is that it is doing stuff people want done. Like a popular code library.

But of course the biggest explanation for the take up of genAI is the hype, coming from both those with commercial interests and those who should be more cautious and thoughtful: governments, universities, health services. The more we are told that it is going to change everything, the more we believe that the outputs are what they seem, i.e. a machine completing a cognitive task.

That fails to stand up to critical reflection on the gap between appearing to complete a cognitive task, giving a superficially plausible and sometimes adequate answer to a question, and actually completing that task. Don't forget that con artists, human impostors and bullshitters have been exploiting this for millennia. There is no need to be taken in if we stop for a second and think about whether everything is as it seems.


You'll only receive email when they publish something new.

More from Tom Stoneham
All posts