Appendix D — Distributions

Author

Alejandro Morales & Joost van Heerwaarden

A probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. There are many different probability distributions, each with its own characteristics and applications. Distributions can be classified into two main categories: discrete and continuous. Discrete distributions are used for count data, while continuous distributions are used for continuous data.

In statistical inference, probability distributions are used to model the uncertainty in the data and to make predictions about the population based on the sample. The choice of the right distribution depends on the nature of the data. In this regard, one may distinguish between two possible interpretations of probability distributions: mechanistic and phenomenological. A mechanistic interpretation is based on the underlying process that generates the data since most distributions can be derived from some stochastic process (see below for details). However, it is not always possible to match the data to a specific process described by a distribution, and in such cases, a phenomenological interpretation is used, where the distribution is chosen based on its mathematical properties and how well it fits the data.

Distributions are characterized by their probability density function (pdf) or probability mass function (pmf), which provides the probability of each possible outcome, and the cumulative density function (cdf), which provides the probability of observing a value less than or equal to a given value. FOr simplicity we will use the term probability density function (pdf) to refer to both continuous and discrete distributions, and we will use the term cumulative density function (cdf) to refer to both continuous and discrete distributions.

We will use probability density functions to build likelihood functions, whereas cumulative density functions are used to calculate p values. Additionally, we will use the quantile function, which is the inverse of the cumulative density function, to calculate confidence intervals.

In each section, we will discuss some of the most common probability distributions used in statistics as well as the most common mechanistic interpretation. We will also provide the probability density function (pdf) and cumulative density function (cdf) of each distribution as well as the R functions to work with them.

D.1 Gaussian or Normal distribution

This is the most common distribution in statistics, hence the term Normal, although it is sometimes also referred to as Gaussian in honor of Carl Friedrich Gauss. Mechanistically, the Normal distribution arises from the sum of a large number of independent and identically distributed (i.i.d.) random variables, according to the Central Limit Theorem. This often applies to measurement errors (which are the sum of many small errors) as well as the distribution of means of samples. Phenomenologically, the Normal distribution is used when a symmetric and bell-shaped distribution is needed. Finally, the Normal distribution possess several analytical properties that make it very convenient to build statistical models.

This distribution is meant for continuous data (not counts!) and is defined by two parameters: the mean \(\mu\) and the variance \(\sigma^2\). The probability density function of the normal distribution is given by:

\[ f(y) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{y-\mu}{\sigma}\right)^2} \]

In statistics notation, we write \(Y \sim N(\mu, \sigma^2)\) to denote that \(Y\) follows a normal distribution with mean \(\mu\) and variance \(\sigma^2\).

The cumulative density function (CDF) of the normal distribution is given by:

\[ F(y) = \int_{-\infty}^{y} f(y) dy = \frac{1}{2}\left[1 + \text{erf}\left(\frac{y-\mu}{\sigma\sqrt{2}}\right)\right] \]

where \(\text{erf}\) is the error function.

In R, the normal distribution is implemented in the function dnorm for the probability density function, pnorm for the cumulative density function, qnorm for the quantile function, and rnorm for generating random numbers. The parameters of these functions are mean and sd rather than mu and sigma:

n = 10
p = 0.42
y = rnorm(n, mean = 0, sd = 1) # generate n random numbers
dnorm(y, mean = 0, sd = 1) # probability density of each value in a sample
pnorm(y, mean = 0, sd = 1) # cumulative density of each value in a sample
qnorm(p, mean = 0, sd = 1) # quantile for cumulative density of p

D.2 t-distribution

The t-distribution is used to model the distribution of the sample mean when the sample size is small and the population variance is unknown. Mechanistically, the t-distribution arises from the ratio of a standard normal random variable and the square root of a chi-squared random variable. Phenomenologically, the t-distribution is used when the sample size is small and the population variance is unknown. The t-distribution is defined by a single parameter: the degrees of freedom \(\nu\). The probability density function of the t-distribution is given by:

\[ f(y) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\Gamma\left(\frac{\nu}{2}\right)} \left(1+\frac{y^2}{\nu}\right)^{-\frac{\nu+1}{2}} \]

where \(\Gamma\) is the gamma function. In statistics notation, we write \(Y \sim t(\nu)\) to denote that \(Y\) follows a t-distribution with \(\nu\) degrees of freedom.

In R, the t-distribution is implemented in the function dt for the probability density function, pt for the cumulative density function, qt for the quantile function, and rt for generating random numbers:

n = 10
p = 0.42
y = rt(n, df = 10) # generate n random numbers
dt(y, df = 10) # probability density of each value in a sample
pt(y, df = 10) # cumulative density of each value in a sample
qt(p, df = 10) # quantile for cumulative density of p

D.3 Binomial distribution

The binomial distribution is used to model the number of successes in a fixed number of independent Bernoulli trials. A Bernoulli trial is a random experiment with two possible outcomes: success and failure. The binomial distribution is defined by two parameters: the number of trials \(n\) and the probability of success \(p\). The probability mass function of the binomial distribution is given by:

\[ f(y) = \binom{n}{y} p^y (1-p)^{n-y} \]

where \(\binom{n}{y}\) is the binomial coefficient, which is the number of ways to choose \(y\) successes out of \(n\) trials. In statistics notation, we write \(Y \sim B(n, p)\) to denote that \(y\) follows a binomial distribution with \(n\) trials and probability of success \(p\).

In R, the binomial distribution is implemented in the function dbinom for the probability mass function, pbinom for the cumulative probability function, qbinom for the quantile function, and rbinom for generating random numbers:

n = 10
p = 0.42
y = rbinom(n, size = 10, prob = 0.5) # generate n random numbers
dbinom(y, size = 10, prob = 0.5) # probability mass of each value in a sample
pbinom(y, size = 10, prob = 0.5) # cumulative density of each value in a sample
qbinom(p, size = 10, prob = 0.5) # quantile for cumulative density of p

D.4 Poisson distribution

The Poisson distribution is used to model the number of events in a fixed interval of time or space (mechanistically) or for count data where the variance scales with the mean. It is defined by a single parameter: the rate \(\lambda\), which is the average number of events in the interval. The probability mass function of the Poisson distribution is given by:

\[ f(y) = \frac{\lambda^y e^{-\lambda}}{y!} \]

In statistics notation, we write \(Y \sim \text{Poisson}(\lambda)\) to denote that \(Y\) follows a Poisson distribution with rate \(\lambda\).

In R, the Poisson distribution is implemented in the function dpois for the probability mass function, ppois for the cumulative probability function, qpois for the quantile function, and rpois for generating random numbers:

n = 10
p = 0.42
y = rpois(n, lambda = 10) # generate n random numbers
dpois(y, lambda = 10) # probability mass of each value in a sample
ppois(y, lambda = 10) # cumulative density of each value in a sample
qpois(p, lambda = 10) # quantile for cumulative density of p
Exercise D.1
  1. Generate a sample of 1000 random numbers from a normal distribution with mean 100 and standard deviation of 25. Plot a histogram of the sample (using freq = FALSE) and overlay the probability density function of the normal distribution. Compute the mean and standard deviation of the sample and compare to theoretical values.
  2. Repear the previous exercise for a Poisson distribution with \(\lambda = 10\).
Solution D.1
  1. We can generate the sample with rnorm:
n = 1000
y = rnorm(n, mean = 100, sd = 25)

We can plot the histogram with hist and overlay the probability density function with curve:

hist(y, freq = FALSE)
curve(dnorm(x, mean = 100, sd = 25), add = TRUE, col = "red")

Finally, we can compute the mean and standard deviation of the sample and compare to the theoretical values:

c(mean(y), 100)
c(sd(y), 25)
  1. We can generate the sample with rpois:
n <- 1000
y <- rpois(n, lambda = 10)

We should not use histograms and curve with the Poisson distribution because it is discrete. Instead, we can compute the relative frequency of each value in the sample using table():

y_freq <- table(y) / n

We can then make a bar plot of the relative frequencies and overlay the probability mass function with lines:

cases <- 1:length(y_freq)
barplot(y_freq~cases, ylim = c(0, 0.15))
lines(cases, dpois(cases, lambda = 10), col = "red", t = "o")

Finally, we can compute the mean and standard deviation of the sample and compare to the theoretical values:

c(mean(y), 10)
c(sd(y), sqrt(10))