Sampling

Sampling

Sampling is the process of selecting which individuals, objects, or observations to study from a larger data generating process (DGP). Sampling determines what your data represent and how far your conclusions can generalize.

Why Sampling Matters

Sampling affects:

  • What part of the data generating process your data reflect

  • How accurate your estimates are

  • Whether patterns reflect reality or bias

  • How confident you can be in your model

Even the best statistical model cannot fix poor sampling.

Data Generating Process vs. Sample

Understanding sampling begins with two key ideas:

Term

Definition

Example

Data Generating Process (DGP)

The broader process that produces the data you care about

All students who take Intro Statistics

Sample

The subset of observations you actually measure

300 students surveyed

Goal of Sampling:
Choose a sample that represents the data generating process.

Common Sampling Methods

Random Sampling

Each observation in the data generating process has an equal chance of being selected.

Example:

  • Randomly selecting student IDs

Why it's useful:

  • Reduces bias

  • Supports generalization to the DGP

Convenience Sampling

Selecting individuals who are easiest to access.

Examples:

  • Students in your class

  • People near you

Limitations:

  • Often biased

  • May not represent the DGP well

Volunteer Sampling

People choose themselves to participate.

Example:

  • Online surveys

Limitations:

  • Volunteers may differ from non-volunteers in the DGP

Sampling Bias

Sampling bias occurs when the sample is systematically different from the data generating process.

Examples:

  • Surveying only morning classes

  • Polling only social media users

  • Studying only one school

Sampling bias can lead to misleading models.

Sample Size

Larger samples generally:

  • Reduce variability

  • Improve precision

  • Strengthen model estimates

However, large biased samples are still biased.

Example:

  • 10,000 responses from one class may still not represent the broader DGP.

Sampling and Generalization

We use samples to make inferences about the data generating process.

Better sampling → Better generalization to the DGP
Poor sampling → Limited conclusions

Example:

If you sample only psychology majors, your conclusions apply mainly to psychology majors within the broader data generating process.

Key Takeaway

Sampling determines what part of the data generating process your data represent.
Good sampling allows you to generalize your models. Poor sampling limits what conclusions you can make.



    • Related Articles

    • sampling distribution

      Sampling distribution is the distribution of an estimate across many possible samples.
    • sampling error

      Sampling error is the variation that occurs from sample to sample due to the fact that no sample is a perfect representation of the population; can be biased or unbiased; also known as sampling variation.
    • sampling variation

      Sampling variation is the variation that occurs from sample to sample due to the fact that no sample is a perfect representation of the population; can be biased or unbiased; also known as sampling error.
    • random sampling

      Random sampling means every case in the population has an equal probability of being selected for the study.
    • independent sampling

      Independent sampling is the selection of one case for a study has no effect on the selection of another case.