Overfitting vs Underfitting and Plato

What our message now signifies is that the ability and means of learning is already present in the soul. As the eye could not turn from darkness to light unless the whole body moved, so it is that the mind can only turn around from the world of becoming to that of Being by a movement of the whole soul. The soul must learn, by degrees, to endure the contemplation of Being and the luminous realms.

— — — Plato

How to Evaluate if a Model is Good or Not?

Now it’s 2025, and humans have developed countless statistical and machine learning models (or algorithms). Still, the No Free Lunch Theorem tells us: that for some datasets, the best model is linear models, while for others it’s neural networks. Thus, there is no model that a priori works best.

Explaining in simple terms: A priori means before an experience (for example, you’ll never know what will happen before a trip abroad == you’ll never know what will happen a priori a trip abroad). So, we receive any dataset, then

The No Free Lunch Theorem says it’s impossible to evaluate the best model before fitting, meaning:

it implicitly tells us about the non-existence of the best model before applying it to the datasets;

Extending, there is no universally better model than the others,

So each model has its advantage, depending on the dataset.

The conclusion of the No Free Lunch Theorem leads us to the next question: since there is no best model a priori, we have to test several models and see which one is superior to the others; but how? There is an efficient metric to evaluate models: MSE (Mean Squared Error).

So, assuming that

Thus, we define that

here are some observations about this formula:

Why MEAN squared error? if we removed the 1/n, it can also be considered a metric, as we are calculating the cumulative error of the prediction of the observations of a model; and of course, the smaller this cumulative error, the better the model. And when we divide by n, we get the average error of each observation.

Why mean SQUARED error? Based on mathematical intuition: if we want to accumulate the errors for each row of observation, we have to ensure that each error (or difference between observation and prediction) of a row of observation is positive, as it may have negative values, so the errors are never accumulating. Therefore, we square it, as a squared value is always greater than or equal to zero.

Based on this formula, we take different models, the smaller the mean squared error, the better, as this means the greater approximation of the estimation to the observed value, meaning the difference between our prediction and reality is small.

There are other metrics, but MSE is the most commonly used.

Underfitting vs Overfitting: Understanding the World

Frequently, we do not have a second dataset with the same structure, meaning all dependent and independent variables are the same, in which we can test our trained model. Thus, the best way is to divide the dataset into two parts: we call these two sets training data and test data. As usual, we divide the dataset into 80 percent for training and the remaining 20 percent for testing, or 90 percent for training and 10 percent for testing.

So, the model trains on the training dataset then takes the test dataset and starts using its “acquired knowledge” to make predictions. With the predicted values (\hat{f}(X)), we can compare with the true value in the test set (Y). And thus, calculate the MSE. This process universally applies to all machine learning models.

Of course, in the model training process, we also have several tuning techniques to acquire the smallest MSE. That is, for the training and testing process, we both use MSE to evaluate the efficiency of the model. And this leads to our following definitions: underfitting and overfitting.

Underfitting: is when the training error is very large, meaning the training MSE is high.
Overfitting: is when the difference between training error and test error is large; meaning the difference between the training MSE and the test MSE is large.

The concept of underfitting exists: if a model is very wrong in its training process, we do not expect it to be a good model. But the questions come when we analyze the concept of overfitting. Why is overfitting a problem? What does it mean that the difference between the training MSE and the test MSE is large?

The difference between the training and test error has two cases: first case, the training error is large and the test error is small. This is indeed not possible, as we have never seen a poorly performing model capable of predicting the test set with excellent performance. So only the second case remains: the training error is low, while the test error is high, this is a less abstract definition of overfitting.

Remember, there is no problem with low training error. The smaller, the better. The problem is when the training is good, the MSE is low, but when we take new situations to test the model, the model has poor performance. That is, translating in non-jargon words: the model did not understand new situations.

As a concept difficult to capture the essence of, it will use a philosophical story to facilitate understanding overfitting — — the famous Myth of the Cave.

Myth of the Cave and Overfitting

Myth of the Cave (or Allegory of the Cave), written before Christ by the Greek philosopher Plato, tells: In a cave, prisoners are chained, forced to look at a wall where shadows are projected. These shadows, created by objects carried behind them and illuminated by a fire, are the only reality the prisoners know.

One of the prisoners is freed and forced to leave the cave. Initially, the sunlight blinds and causes pain, but gradually he gets used to the outside world. He realizes that the shadows in the cave were just imitations of reality, and that the real world is much more complex and beautiful.

The prisoner returns to the cave to share his discovery with the others, but they refuse to believe him and ridicule him. They prefer to remain in ignorance, believing that the shadows are reality.

The allegory above has a lot of influence in the history of philosophy, but today I will introduce a new perspective: the minds of the chained prisoners in the cave, interpreting the shadows on the wall as the whole world, are overfitting. The “statistical model,” which in this case are the minds of the prisoners, with their training data (shadow projected on the wall), are trained to accept only one reality; and the most interesting fact is that this “statistical model” is so well trained for this training set that they convinced themselves that the world is a wall with shadows. But we, out of context, understand that our real world is outside the cave, as that prisoner who got rid of the chain witnessed. The world outside the cave can be considered as the test set, so the prisoners who are chained in the cave cannot understand the reality out there, so we can say that their minds were only trained from birth to accept one reality. This reality is simple, not true, much less beautiful and complex.

We now know that the prisoners inside the cave are suffering “overfitting of simple reality.” But here’s the question: are we not suffering the same thing? Are we also prisoners of this world we are living in? Assuming that our reality, our world or universe is a training set, and our mind is so well trained, and adapted to our reality, that we cannot accept a more complex reality? That is, how can we know if we have “overfitting of simple reality” or not? These are interesting questions to answer.

One response

  1. […] Overfitting vs Underfitting and Plato […]

    Like

Leave a reply to When each human is a line of the dataset – Hubujiusi Cancel reply