Under the Hood of AI Models: A Simple Guide for Non-Technical Readers

In this post, I will explain AI models inside and out, how AI models work, what is the math behind, all the training and inferencing processes. All of them in a simple way so that you don’t need any Computer Science knowledge to understand.

What is a Machine Learning model?

You can think of a ML model as a giant math function F(x) = y, where x is the input and y is the output.

Examples:

input an image, classify if it is cat or dog
input some text (prompt), output some text (answer)
input house info, output its predicted price

No matter what the input is, we always convert it to numbers:

a digital image is converted to a 3D matrix of numbers
text is converted to a sequence of tokens, each token a number
house info is converted to numbers too, categories like flat or house are converted to numbers as well

The reason is math functions can only read numbers. So input x is a collection of numbers. Similarly, the output y is a collection of numbers, then we convert it to a format humans can understand (text, images, sounds).

F is just a math function. Suppose x is just one number to keep things simple. F(x) can be 5 + 3x + 4x^2. In this case F has 3 params: 5, 3, and 4. When you input x, you get back output y. The giant function F is way larger than that. When they say gpt-oss-120B has 120B params, this means the function F has 120 billion params, instead of 3.

Training a model

Consider a simple problem: we want a function F that takes an image and outputs either cat or dog. Then x is an image of a cat or dog and y is 0 or 1, cat or dog. If we have F, we can just input images and get back the classification, saving a lot of time classifying by hand. But how could we find such a function F? By training.

We start with

a random F, say 1 + x + x^2, and
a set of sample pairs (x, y), which is called training data

and start modifying the parameters to make F(x) ≈ y for all samples in our training data. For example, after training, you get F(x) = 5 + 3x + 4x^2. We hope that F(x) ≈ y not only on the training data but on unseen data too. Otherwise it is not useful because we want F to label all images of dogs and cats, not just the ones it has already seen.

You may have already heard of “fine-tuning”, which means you start with a model F someone already trained, then slightly modify it with some new sample data you have, and end up with a new model F(x) = 5.1 + 3.3x + 3.9x^2, which is better for your use cases. Fine-tuning is faster and easier than training because you just wiggle the params a little bit.

Training a good F from a random one is manageable if F has 3 params, but with billions of params, it looks impossible because there are too many possibilities.

Model architecture

Let’s start with the picture of a cat. Humans know that it is a cat by finding typical features of a cat: ear, eye, nose, tail, whiskers. AI models do the same, they find features of a cat. How does it do that? For a feature, say an ear, it needs to:

Step (1). Locate the area of the ear.
Step (2). Remove all the background and unnecessary details to identify the shape of the ear.
Step (3). Recognize that the shape is similar to the shape of cat’s ears it already saw.
Step (4). Does the above for other features like eye, nose, etc. Combining all of these info, it finally concludes whether that is a cat or not.

The model does it in quite a brute-force way. For Step (1), it looks around to find all boxes of various sizes. That way, one box will capture exactly the ear, and another box will capture the nose, etc. Many other boxes will not contain relevant info, but are still checked. So you can see why AI models take a lot of computation here.

Step (2) is where the most interesting thing happens. With a box of cat’s ear, it applies a “filter” on the box, like you apply filters on an Instagram photo. Then it makes a “cut”: everything darker than this much is colored in black, everything else is kept untouched. That is how it removes background or unnecessary details, hopefully revealing some curve of the ear.

One filter is not enough, it does hundreds of different filters, and filters on top of filters on top of filters. That is how we get the term Deep Learning: the neural network is deep with layers on top of layers, where each layer is a filter combined with a cut. Finally, combining all the results, it concludes whether that is a cat ear or not with some confidence score, eg. 95%. That is Step (3).

Combining all confidence scores of all features: ear, nose, eye…, it concludes that this is a cat with eg. 97% confidence.

The math of model architecture

The model gets all boxes of various sizes in Step (1) by simply

shifting a window of 3x3 pixels around the image, it extracts all boxes of size 3x3,
then a window of 3x3 on top of that, it extracts all boxes of size 5x5,

and so on. All of this box shifting can be done by matrix multiplication.

Step (2) contains 2 operations:

Filter: The official name is linear transformation. That is also just matrix multiplication and addition.
Cut: The official name is activation. The case I describe above is ReLU activation. ReLU is simply all negative numbers become 0 (i.e. color them in black), all positive numbers are kept untouched. There are other activations too, but ReLU is the gold standard.

Activations are very important, like the sauce of your salad. That is why researchers are still actively finding new good activations. Here is an important characteristic of an activation: it is non-linear. Matrix multiplication and addition are linear, in the sense if you apply your Instagram filters 10 times, it is still a filter, it can’t do more than that. Only a non-linear one like the “cut” operation I describe above can discard unimportant info and leave you the curve of the ear.

Step (3) and (4) are combining numbers, so again are linear transformations + activations. Finally, we get the final result, eg. 84% confidence that it is a cat. Then depending on how much trade off between false positive and false negative, we can conclude it is a cat, or abstain from saying. That can be decided by a threshold.

What I describe above is a convolutional neural network (CNN). Other model architectures are a bit different, but the spirit is the same. Equipped with the architecture of a model, next time we will learn how a model trains itself.

[To be continued]

Under the Hood of AI Models: A Simple Guide for Non-Technical Readers

What is a Machine Learning model?

Training a model

Model architecture

The math of model architecture

Related Posts

You Can't Eliminate LLM Hallucinations

Orchard Robotics: Treat Every Tree Like a Pet With AI

Perplexity Comet Review: Agentic Browsing at Comet Speed

Reaching AGI by Using the Human Feedback Loop