Simple "AI" explanation

Back in the mid to late 80s, I wanted to write a program that I could use to generate character names for my role-playing games. I had recently read about a method to use letter frequency and I wanted to try it out. This should be an extremely simplified explanation of how LLMs work. Maybe this would be a MNM, a Minimal Name Model.

I needed to write two programs. The first would take a list of example data and build the dataset, the MNM. The second program would use the data in the MNM to randomly generate names.

Here's the first and very minimal sample name list:

  • SAM
  • SCOTT

We can see that the names all start with S and end with either M or T. S is followed either by A or C, A by M, C by O and T by T. One way to store this data is to have a list of letters, each with a list of following letters. We're also going to add two fake letters, one to represent the beginning of a name (I choose "[") and the other to represent the end of a name ("]"). So, the new seed list in our highly oversimplified case looks like this:

  • [SAM]
  • [SCOTT]

And the dataset will look something like this:

A C M O S T ]
[ 1
A 1
C 1
M 1
O 1
S 0.5 0.5
T 0.5 0.5

Probability table of which letters (columns) follow which other letters (rows)

Once the first program compiles the above table, we move onto the next program which uses this data to generate our new list of names. It starts with the beginning "letter" [ and randomly chooses a letter from that row based on the probabilties of the other letters. There's only one option here so it chooses S. To find the letter following S it chooses the S row and, based on the probabilities of the letters in that row it chooses either A (50%) or C (50%). Next, it loads the row corresponding to the selected letter and repeats the algorithm until it picks the end "letter" ]. Here is the list of all the names that can be generated from this set of data (duplicates removed).

SAM

SCOT

SCOTT

SCOTTT

SCOTTTT

SCOTTTTT

SCOTTTTTT

...

There are problems with this very simple algorithm and adding more rules can fix some of them. For one, we'd need to add a maximum name length before we get to SCOTTTTTTTTTTTTT....

The non-algorithmic problem is that we don't have a wide enough selection of names to seed or dataset. For instance, just adding STEVIE and SAMANTHA to the sample name list would wildly increase the variety of names that could be generated. And they would still all start with S.

So, is this how text generation works? Yeah, at a grossly oversimplified level. For instance, the LLMs are not using letter-by-letter probabilities. Web crawlers and other tools search out text to feed the program that builds the model. The user-facing application like Gemini, Copilot, or ChatGPT are the analog to the second program that generates output.

As an exercise for the reader, please consider where the "artificial intelligence" would reside in this application? Here are some possible answers:

  1. Is it in the list of names used to seed the application? (See Arthur C. Clarke's "The Nine Billion Names of God".)
  2. Is it in the first program that processes the initial list of names into the letter probability table?
  3. Is it in the letter probability table?
  4. Is it in the second program that randomly chooses letters based on the probabilities in each row?
  5. Is it, maybe, in the random number generator used by the second program? (This is the most mystical answer, I think.)
  6. Is it in the human programmer who designed and wrote the program? (In which case, I'd ask: What's "artificial" about that?)

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

If you like this (or don't!) and want to leave me a comment, use the guestbook feature in the upper right corner. On a phone, tap the "hamburger" (three-bar) menu button. Please mention which article you're referring to.

I love it when I get an email telling me that someone clicked one or more of the react buttons at the bottom of the post. However, I have no idea who did it. If you're so inclined, leave a guestbook note!

Written and posted and copyrighted on 2025-02-15 by me. Quote me if you want (linking to this post would be nice, too) but no one and no thing have permission to slurp this into any LLM vomit factory as training data.


You'll only receive email when they publish something new.

More from Scmasm
All posts