= 10
n = 0.42
p = rnorm(n, mean = 0, sd = 1) # generate n random numbers
y dnorm(y, mean = 0, sd = 1) # probability density of each value in a sample
pnorm(y, mean = 0, sd = 1) # cumulative density of each value in a sample
qnorm(p, mean = 0, sd = 1) # quantile for cumulative density of p
Appendix D — Distributions
A probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. There are many different probability distributions, each with its own characteristics and applications. Distributions can be classified into two main categories: discrete and continuous. Discrete distributions are used for count data, while continuous distributions are used for continuous data.
In statistical inference, probability distributions are used to model the uncertainty in the data and to make predictions about the population based on the sample. The choice of the right distribution depends on the nature of the data. In this regard, one may distinguish between two possible interpretations of probability distributions: mechanistic and phenomenological. A mechanistic interpretation is based on the underlying process that generates the data since most distributions can be derived from some stochastic process (see below for details). However, it is not always possible to match the data to a specific process described by a distribution, and in such cases, a phenomenological interpretation is used, where the distribution is chosen based on its mathematical properties and how well it fits the data.
Distributions are characterized by their probability density function (pdf) or probability mass function (pmf), which provides the probability of each possible outcome, and the cumulative density function (cdf), which provides the probability of observing a value less than or equal to a given value. FOr simplicity we will use the term probability density function (pdf) to refer to both continuous and discrete distributions, and we will use the term cumulative density function (cdf) to refer to both continuous and discrete distributions.
We will use probability density functions to build likelihood functions, whereas cumulative density functions are used to calculate p values. Additionally, we will use the quantile function, which is the inverse of the cumulative density function, to calculate confidence intervals.
In each section, we will discuss some of the most common probability distributions used in statistics as well as the most common mechanistic interpretation. We will also provide the probability density function (pdf) and cumulative density function (cdf) of each distribution as well as the R functions to work with them.
D.1 Gaussian or Normal distribution
This is the most common distribution in statistics, hence the term Normal, although it is sometimes also referred to as Gaussian in honor of Carl Friedrich Gauss. Mechanistically, the Normal distribution arises from the sum of a large number of independent and identically distributed (i.i.d.) random variables, according to the Central Limit Theorem. This often applies to measurement errors (which are the sum of many small errors) as well as the distribution of means of samples. Phenomenologically, the Normal distribution is used when a symmetric and bell-shaped distribution is needed. Finally, the Normal distribution possess several analytical properties that make it very convenient to build statistical models.
This distribution is meant for continuous data (not counts!) and is defined by two parameters: the mean \(\mu\) and the variance \(\sigma^2\). The probability density function of the normal distribution is given by:
\[ f(y) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{y-\mu}{\sigma}\right)^2} \]
In statistics notation, we write \(Y \sim N(\mu, \sigma^2)\) to denote that \(Y\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\).
The cumulative density function (CDF) of the normal distribution is given by:
\[ F(y) = \int_{-\infty}^{y} f(y) dy = \frac{1}{2}\left[1 + \text{erf}\left(\frac{y-\mu}{\sigma\sqrt{2}}\right)\right] \]
where \(\text{erf}\) is the error function.
In R, the normal distribution is implemented in the function dnorm
for the probability density function, pnorm
for the cumulative density function, qnorm
for the quantile function, and rnorm
for generating random numbers. The parameters of these functions are mean
and sd
rather than mu
and sigma
:
D.2 t-distribution
The t-distribution is used to model the distribution of the sample mean when the sample size is small and the population variance is unknown. Mechanistically, the t-distribution arises from the ratio of a standard normal random variable and the square root of a chi-squared random variable. Phenomenologically, the t-distribution is used when the sample size is small and the population variance is unknown. The t-distribution is defined by a single parameter: the degrees of freedom \(\nu\). The probability density function of the t-distribution is given by:
\[ f(y) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\Gamma\left(\frac{\nu}{2}\right)} \left(1+\frac{y^2}{\nu}\right)^{-\frac{\nu+1}{2}} \]
where \(\Gamma\) is the gamma function. In statistics notation, we write \(Y \sim t(\nu)\) to denote that \(Y\) follows a t-distribution with \(\nu\) degrees of freedom.
In R, the t-distribution is implemented in the function dt
for the probability density function, pt
for the cumulative density function, qt
for the quantile function, and rt
for generating random numbers:
= 10
n = 0.42
p = rt(n, df = 10) # generate n random numbers
y dt(y, df = 10) # probability density of each value in a sample
pt(y, df = 10) # cumulative density of each value in a sample
qt(p, df = 10) # quantile for cumulative density of p
D.3 Binomial distribution
The binomial distribution is used to model the number of successes in a fixed number of independent Bernoulli trials. A Bernoulli trial is a random experiment with two possible outcomes: success and failure. The binomial distribution is defined by two parameters: the number of trials \(n\) and the probability of success \(p\). The probability mass function of the binomial distribution is given by:
\[ f(y) = \binom{n}{y} p^y (1-p)^{n-y} \]
where \(\binom{n}{y}\) is the binomial coefficient, which is the number of ways to choose \(y\) successes out of \(n\) trials. In statistics notation, we write \(Y \sim B(n, p)\) to denote that \(y\) follows a binomial distribution with \(n\) trials and probability of success \(p\).
In R, the binomial distribution is implemented in the function dbinom
for the probability mass function, pbinom
for the cumulative probability function, qbinom
for the quantile function, and rbinom
for generating random numbers:
= 10
n = 0.42
p = rbinom(n, size = 10, prob = 0.5) # generate n random numbers
y dbinom(y, size = 10, prob = 0.5) # probability mass of each value in a sample
pbinom(y, size = 10, prob = 0.5) # cumulative density of each value in a sample
qbinom(p, size = 10, prob = 0.5) # quantile for cumulative density of p
D.4 Poisson distribution
The Poisson distribution is used to model the number of events in a fixed interval of time or space (mechanistically) or for count data where the variance scales with the mean. It is defined by a single parameter: the rate \(\lambda\), which is the average number of events in the interval. The probability mass function of the Poisson distribution is given by:
\[ f(y) = \frac{\lambda^y e^{-\lambda}}{y!} \]
In statistics notation, we write \(Y \sim \text{Poisson}(\lambda)\) to denote that \(Y\) follows a Poisson distribution with rate \(\lambda\).
In R, the Poisson distribution is implemented in the function dpois
for the probability mass function, ppois
for the cumulative probability function, qpois
for the quantile function, and rpois
for generating random numbers:
= 10
n = 0.42
p = rpois(n, lambda = 10) # generate n random numbers
y dpois(y, lambda = 10) # probability mass of each value in a sample
ppois(y, lambda = 10) # cumulative density of each value in a sample
qpois(p, lambda = 10) # quantile for cumulative density of p
Exercise D.1
- Generate a sample of 1000 random numbers from a normal distribution with mean 100 and standard deviation of 25. Plot a histogram of the sample (using
freq = FALSE
) and overlay the probability density function of the normal distribution. Compute the mean and standard deviation of the sample and compare to theoretical values. - Repear the previous exercise for a Poisson distribution with \(\lambda = 10\).
Solution D.1
- We can generate the sample with
rnorm
:
= 1000
n = rnorm(n, mean = 100, sd = 25) y
We can plot the histogram with hist
and overlay the probability density function with curve
:
hist(y, freq = FALSE)
curve(dnorm(x, mean = 100, sd = 25), add = TRUE, col = "red")
Finally, we can compute the mean and standard deviation of the sample and compare to the theoretical values:
c(mean(y), 100)
c(sd(y), 25)
- We can generate the sample with
rpois
:
<- 1000
n <- rpois(n, lambda = 10) y
We should not use histograms and curve
with the Poisson distribution because it is discrete. Instead, we can compute the relative frequency of each value in the sample using table()
:
<- table(y) / n y_freq
We can then make a bar plot of the relative frequencies and overlay the probability mass function with lines
:
<- 1:length(y_freq)
cases barplot(y_freq~cases, ylim = c(0, 0.15))
lines(cases, dpois(cases, lambda = 10), col = "red", t = "o")
Finally, we can compute the mean and standard deviation of the sample and compare to the theoretical values:
c(mean(y), 10)
c(sd(y), sqrt(10))