Sampling is the process of selecting which individuals, objects, or observations to study from a larger data generating process (DGP). Sampling determines what your data represent and how far your conclusions can generalize.
Sampling affects:
What part of the data generating process your data reflect
How accurate your estimates are
Whether patterns reflect reality or bias
How confident you can be in your model
Even the best statistical model cannot fix poor sampling.
Understanding sampling begins with two key ideas:
Goal of Sampling:
Choose a sample that represents the data generating process.
Each observation in the data generating process has an equal chance of being selected.
Example:
Randomly selecting student IDs
Why it's useful:
Reduces bias
Supports generalization to the DGP
Selecting individuals who are easiest to access.
Examples:
Students in your class
People near you
Limitations:
Often biased
May not represent the DGP well
People choose themselves to participate.
Example:
Online surveys
Limitations:
Volunteers may differ from non-volunteers in the DGP
Sampling bias occurs when the sample is systematically different from the data generating process.
Examples:
Surveying only morning classes
Polling only social media users
Studying only one school
Sampling bias can lead to misleading models.
Larger samples generally:
Reduce variability
Improve precision
Strengthen model estimates
However, large biased samples are still biased.
Example:
10,000 responses from one class may still not represent the broader DGP.
We use samples to make inferences about the data generating process.
Better sampling → Better generalization to the DGP
Poor sampling → Limited conclusions
Example:
If you sample only psychology majors, your conclusions apply mainly to psychology majors within the broader data generating process.
Sampling determines what part of the data generating process your data represent.
Good sampling allows you to generalize your models. Poor sampling limits what conclusions you can make.