Glossary
sampling with replacement
Sampling with replacement takes a sample from a population, record the values, put all cases back into the population, then sample again; the R function resample() does this.
random process
Random process means that all possible outcomes have a predictable chance of coming up; whereas we often think that random means unpredictable, random processes, the way statisticians think of them, are actually highly predictable, governed by a ...
law of large numbers
Law of large numbers states that in the long run, by either collecting lots of data or doing a study many times, we will get closer to understanding the true population and DGP.
sample distribution
When sampled data—units that were selected and measured—are shown in such a way where we can see patterns of variation (such as shape, center, and spread); sample distributions can be seen in visualizations such as histograms, boxplots, etc. but also ...
distribution triad
Distribution triad means sample distribution, DGP/population distribution and sampling distribution.
unimodal distribution
In unimodal distribution, most values are clustered in the center, with tails going out to either side.
uniform distribution
In uniform distribution, the number of observations is evenly distributed across the possible values.
symmetrical distribution
In symmetrical distribution, the left side of the distribution mirrors the right side.
spread of a distribution
The spread of a distribution is how spread out or wide the distribution is; a way to characterize how much variability there is in the sample on a particular variable.
smooth density plot
A smooth density plot is a smooth shape overlaid on a histogram, displaying roughly the proportion of each of the values on the x-axis.
skewed distribution
Skewed distribution is asymmetrical; to the left (the skinny longer tail is on the left) or to the right (the skinny longer tail is on the right).
shape of a distribution
The overall shape of the distribution (e.g., uniform, symmetrical, skewed, unimodal, bimodal, bell-shaped, normal).
normal distribution
Normal distribution is unimodal, symmetrical, mostly scores clumped in the center, few scores far away from center; also known as bell-shaped distribution; this is a frequently used probability distribution; the shape of the normal distribution ...
center of a distribution
Center of a distribution is the peak of a distribution.
bimodal distribution
Bimodal distribution has two clear clumps of scores around two parts of the measurement scale, with few in the middle.
bell-shaped distribution
Bell-shaped distribution is unimodal, symmetrical, mostly scores clumped in the center, few scores far away from center; also known as normal distribution.
relative frequency histogram
Relative frequency histogram is a histogram that represents proportion (instead of frequency) of cases on the y-axis.
histogram
Histogram is a visualization where the x-axis represents the values of the variable while the y-axis represents frequency; the height of a bar in a histogram represents how many cases have that range of values.
density
In a histogram, density is roughly the same as proportion.
distribution
Distribution is the pattern of variation in a variable or set of variables.
quartiles
Quartiles are the result of sorting quantitative variables and dividing the observations into four groups of equal sizes.
mean
Mean is the average, the number in the distribution that balances the residuals.
median
Median is the middle number when numbers are sorted in order.
aggregation
Aggregation is putting together separate elements (e.g., when we aggregate data, we might be putting separate data values together into one summary statistic).
tidy data
Tidy data is a way of organizing data into rectangular tables, with rows and columns, in which each column is a variable, each row is an observation, and each type of observational unit it kept in a different table.
sampling variation
Sampling variation is the variation that occurs from sample to sample due to the fact that no sample is a perfect representation of the population; can be biased or unbiased; also known as sampling error.
sampling error
Sampling error is the variation that occurs from sample to sample due to the fact that no sample is a perfect representation of the population; can be biased or unbiased; also known as sampling variation.
random sampling
Random sampling means every case in the population has an equal probability of being selected for the study.
population
Population is the universe of cases that could be sampled and measured for a study.
independent sampling
Independent sampling is the selection of one case for a study has no effect on the selection of another case.
mistake
Mistake is error caused by human misunderstanding or some problem in the data gathering process (e.g., misspellings, people who didn't follow directions, data that have been entered into the computer incorrectly).
measurement error
Measurement error is error caused by the natural fluctuation in most real-world measurements.
numeric (in R)
Numeric (in R) is a type of R object that is a quantitative variable.
quantitative variables
Quantitative variables are values that represent some quantity.
factor (in R)
Factor (in R) is a type of R object that is a categorical variable.
categorical variable
Categorical variable is a variable for which the values don’t tell us anything about quantity; the values simply tell us which category the object belongs to.
sampling
Sampling is the process by which we choose which objects to study.
Measurement
Measurement is the process of assigning numbers or categories to variables so they can be analyzed, modeled, and used to answer research questions. Measurement is the foundation of all analysis because what you measure (and how you measure it) ...
Data Frame
A data frame is a way to organize data into rows and columns, similar to a spreadsheet or table. Data frames are one of the most common ways to store and work with data in statistics, data science, and R. How Data Frames Work Each row represents one ...
Values
Values (in a dataset) Values are the specific entries recorded for each case (row) on a given variable (column) in a dataset. A value represents the observed measurement, category, or state of that variable for an individual case. In other words, if ...
Next page
Popular Articles
tally()
The tally() function will count, or tally, the number of cases that are observed in each category of a variable. Example 1: Use tally() to count the number of observations in each category of a categorical variable. # Use tally() to count the number ...
desc()
The desc() function can be used with the arrange() function to arrange a variable in a data frame in descending order. Example 1: For instance, when we use the arrange() function to sort the Fingers data frame by Thumb, it will sort the values for ...
favstats()
The favstats() function will compute a set of common summary statistics ("favorite stats") for a given variable, including the five-number summary (minimum, Q1, median/Q2, Q3, maximum), the mean, the standard deviation, the sample size (n), and the ...
arrange()
The arrange() function will arrange a data frame by a specific variable, in ascending order. You can use the desc() argument with the arrange() function to arrange the data frame in descending order. NOTE: The arrange() function is similar to the ...
Statement on Sex and Gender
Many people use sex and gender interchangeably, but in truth, they’re distinct concepts. Sex is a classification based on biological characteristics, including DNA and anatomy. Gender refers to the socially constructed roles, behaviors, ...