A statistical model is a simplified way of describing how data are generated. It helps us separate what we can explain using known information from what we cannot explain perfectly. We use statistical models for three main purposes:
At a basic level, many statistical models follow the same idea, which can be written as a word equation:
Data = Model + Error
Data are the values we actually observe.
Model represents the systematic part we are trying to explain or predict using variables we measure.
Error captures everything the model does not explain, such as randomness, measurement noise, or missing information.
Suppose we want to understand students’ exam scores.
Exam score = Effect of study time + Error
Here, the model explains exam scores using study time, while the error includes factors like test anxiety, question difficulty, or luck.
If we want to predict house prices:
House price = Effect of size and location + Error
The model uses measurable features (such as square footage and neighborhood), and the error accounts for things like buyer preferences or unmeasured features of the house.
In statistics and data science, this idea is often written more formally using General Linear Model (GLM) notation:
Yi = β0 + β1Xi = εi
Here:
(Y) is the outcome we want to explain or predict (for example, exam scores or house prices).
(X) is a set of input variables (such as study time, house size, or location).
(β) represents the model’s parameters, which describe how strongly each input variable affects the outcome.
(ε) (epsilon) is the error term, representing unexplained variation.
a student’s exam score,
the expected price of a house,
or the probability that an email is spam.
By comparing these predicted values to the actual data, we can evaluate how well a model performs, and we can use data to make informed, evidence-based decisions.