Appendix C — Statistical inference

Author

Alejandro Morales & Joost van Heerwaarden

C.1 Basics of statistical inference

C.1.1 Random variables

Random variables are fundamental to statistical inference as we use them to account for uncertainty. A random variable is a variable whose value can change in a random manner but still following a particular distribution of possible values. Random variables can be classified into two main groups: discrete and continuous. Discrete random variables take on a countable number of distinct values, while continuous random variables take on an uncountable number of values. In practice, we will use discrete random variables (and their associated distributions) to analyze count data, such as the number of insects on a plant, while continuous random variables (and their associated distributions) will be used to analyze continuous data, such as the biomass of a plant.

A sample is a set of values realized from a random variable (either generated through stochastic simulation or representing observations of the real world). The sample space is the set of all possible values that a random variable can take. In some cases, the sample space will be unbounded (i.e., all possible values are allowed) while in some cases it will be bounded (i.e., the values are restricted to a certain range).

C.1.2 Probability

The probability distribution of a random variable is a function that assigns probabilities to the different values within the sample space. Probability determines how likely or frequent each value is to be generated or observed. In the case of discrete random variables, each individual value will have a specific probability assigned to it. In the case of continuous random variables, the probability is assigned to intervals of values rather than individual values. For this reason, when dealing with continuous random variables, we talk about probability density rather than probability.

In mathematical notation, we will represent the probability of a value \(x\) being realized from a discrete random variable \(X\) as \(p(X = x)\) whereas the probability density for a continuous random variable is denoted as \(f(X = x)\). When we refer to the distributions we will simply write \(p(X)\) or \(f(X)\).

Many probability distributuons exist, but probabilities must always satisfy certain properties:

  • The probability of any value must be between 0 and 1 (probability densities cannot be negative either but can be larger than 1).

  • The sum of the probabilities across the sample space must be equal to 1. In the case of continuous random variables, the integral of the probability density function over the entire sample space must be equal to 1.

  • The probability (density) of two values being realized independently is the product of their respective probabilities (densities). That is, given values \(a\) and \(b\) from random variables \(X\) and \(Y\), their join probability (density) is \(p(X = x, Y = y) = p(X = x)p(Y = y)\)

  • If the two random variables are not independent, then one needs to take into account the conditional probability (density) of the second value being realized given the first value. We denote conditional probabilities with a vertical bar, such that their joint probability (density) becomes \(p(X = x, Y = y) = p(X = x)p(Y = y|X = x)\). The term \(p(Y = y|X = x)\) is read as the probability (density) of \(Y\) being \(y\) given that \(X\) is \(x\).

One can see that the definition of independence is simply \(p(Y = y|X = x) = p(Y = y)\). That is, if the probability of \(Y\) being \(y\) is the same regardless of the value of \(X\), then \(X\) and \(Y\) are independent.

C.1.3 From sample to population

Statistical inference is the process of making estimates about a population based on a sample of data. The goal of statistical inference is to draw conclusions about the population based on the sample, while accounting for the uncertainty in the data. There are two main types of statistical inference: estimation and hypothesis testing. Estimation is the process of making estimates about the population parameters, such as the mean or variance, based on the sample data. Hypothesis testing is the process of testing a hypothesis about the population, such as whether the population mean is equal to a specific value.

The concept of population and sample is fundamental in statistical inference. The population is the entire group of individuals or objects that we are interested in studying, while the sample is a subset of the population that we actually observe. In some cases there is a clear population, such as all the farmers in a region, whereas in other cases the population is more abstract such all the possible measurements a sensor could make of the same phenomenon, given all the unknown random factors that could affect the measurement.

In classical statistics, the population is assumed to be fixed and the sample is assumed to be drawn randomly from the population. That is, the same population could lead to different samples and therefore different estimates would be made about the same population. In practice, we will calculate the best estimate of the population parameter based on a single sample, but we will also calculate the uncertainty of that estimate to account for the fact that it will vary across different possible samples. This leads to two types of estimates: point estimate (our best guess of the population parameter) and interval estimate (a range of values that reflect the uncertainty of the point estimate).

C.1.3.1 Samples and random variables

There are two possible ways to think about the relationship between samples and random variables:

  1. The sample contains multiple observations of a single random variable. That is, a sample of \(n\) values correspond to \(n\) observations of the same random variable.

  2. Each value of a sample is an observation a different random variable. Thus, a sample of \(n\) values corresponds to \(n\) different random variables.

While the first interpretation is more intuitive and matches how we think about samples in practice, the second interpretation is more general and allows for more complex statistical models. Also, the first interpretation feels more natural when there is a clear population from which samples are drawn whereas the second interpretation is more natural when the variation is due to different random factors that affect each observation (e.g., measurement errors, uncontrollable factors, etc.).

The two interpretations are equivalent when the random variables are independent and identically distributed (i.i.d.), which is a common assumption in simple statistical models. However, a big part of this course is to challenge these assumptions and explore more complex models where the mean and/or variance can vary across observations and the observations are not independent. For that reason, we will adopt the second interpretation in this course.

C.1.3.2 Point estimator

A point estimator is a function of the sample data that is used to estimate a population parameter. For example, the sample mean may be used to estimate the population mean, or the sample variance may be used to estimate the population variance. Because it is a function of the sample data, a point estimator is itself a random variable and will vary across different samples. The distribution of a point estimator is called the sampling distribution, and it is used to quantify the uncertainty in the estimate and test hypotheses. In practice, we do not know the sampling distribution since we only have one sample, but we can estimate it from the sample data based on assumptions about the population or observation process.

The quality of a point estimator is determined by its bias and variance. The bias of a point estimator is the difference between the true value of the population parameter and the average value of the point estimator, whereas the variance of a point estimator is a measure of how much the estimates vary across different samples.

C.1.3.3 Interval estimates

C.1.3.3.1 Standard error

The standard error is a measure of the uncertainty in an estimate of a population parameter based on a sample of data. The standard error is defined as the standard deviation of the sampling distribution of the estimator. Of course, in practice we do not know this distribution (since we only have one sample!) so the standard error is itself estimated from the sample.

For the sample mean, the standard error is often estimated as:

\[ \widehat{\sigma_{\bar{x}}} = \frac{s_x}{\sqrt{n}} \]

where \(s\) is the sample standard deviation, and \(n\) is the sample size. For other statistical estimators the standard error can be more complex to calculate, but the idea is the same: it is a measure of how much the estimate is expected to vary across different samples.

C.1.3.3.2 Confidence intervals

A confidence interval is a range of values that is used to estimate the population parameter with a certain level of confidence (denoted in percentage) and also relies on the sample distribution. The confidence interval is defined as the interval that contains the population parameter with a certain frequency when computed from different samples (e.g., 95% of the time the confidence interval will include the population parameter). As with the standard error, it is not possible to calculate the exact confidence interval since we do not have access to the sampling distribution but rather it is estimated from the sample data.

For the sample mean, the confidence interval may be calculated as:

\[ \bar{x} \pm t_{\alpha/2, n-1} \widehat{\sigma_{\bar{x}}} \]

where \(t_{\alpha/2, n-1}\) is the \(\alpha/2\) quantile in the t distribution with n - 1 degrees of freedom. For example, for a 95% confidence interval, \(\alpha = 0.05\).

C.1.4 Means and expectations

C.1.4.1 Mean of a sample

Th arithmetic mean (also know as average or just mean) is a measure of the central tendency of a set of values. The mean of a sample is defined as the sum of the values divided by the number of values:

\[ \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i \]

where \(n\) is the number of observations, and \(x_i\) is the \(i\)-th observation. The mean is a measure of the central tendency of the data and is often used as a summary statistic to describe the data. A generalization of the mean is the weighted mean, where each value is multiplied by a weight before summing. The weighted mean is defined as:

\[ \overline{x} = \frac{1}{n} \sum_{i=1}^n w_i x_i \]

where \(w_i\) are the weights assigned to each value.

C.1.4.2 Expectation

The expectation of random variable is a measure of the central tendency of the distribution and may be seen as a generalization of the weighted average. The expectation of a random variable \(x\) is denoted by \(E[X]\) and is defined as the weighted mean of the possible values of \(x\), where the weights are given by the probability of each value. For a discrete random variable, the expectation is defined as:

\[ E[X] = \sum_{i=1}^n x_i p(x_i) \]

where \(x_i\) are the possible values of the random variable, and \(p(x_i)\) are the probabilities of each value. For a continuous random variable, the expectation is defined as:

\[ E[X] = \int_{-\infty}^{\infty} x f(x) dx \]

where \(f(x)\) is the probability density function of each value \(x\).

The concept of expectation can generalize to functions of random variables. The expectation of a function of a random variable \(g\) is defined as:

\[ E[g(X)] = \sum_{i=1}^n g(x_i) p(x_i) \]

for discrete random variables, and:

\[ E[g(X)] = \int_{-\infty}^{\infty} g(x) f(x) dx \]

for continuous random variables.

C.1.5 Variance, covariance, and correlation

C.1.5.1 Variance

The variance of a sample or a random variable is a measure of how spread out the values are around the mean. On a sample, the variance is defined as the average of the squared differences between each value and the mean:

\[ \text{Var}(\boldsymbol{\mathrm{x}}) = s^2_x = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2, \]

where \(n\) is the number of observations, \(x_i\) is the \(i\)-th observation, and \(\bar{x}\) is the sample mean. On a random variable, the variance is defined as the expected value of the squared difference between the random variable and its mean:

\[ \text{Var}(X) = \sigma^2_X = E[(X - \mu_X)^2] \]

where \(\mu_X\) is the mean of the random variable, and \(x\) is the random variable. Since the variance does not have the same units as the data, it is common to take the square root of the variance to obtain the standard deviation for easier interpretation:

\[ \text{SD}(X) = \sqrt{\text{Var}(X)} \]

C.1.5.2 Covariance

The covariance is a measure of the relationship between two random variables. The covariance between two random variables \(X\) and \(Y\) is defined as the expected value of the product of the differences between each random variable and its mean:

\[ \text{Cov}(X, Y) = \sigma_{XY} = E[(X - \mu_X)(Y - \mu_Y)] \]

where \(E\) denotes the expected value, \(\mu_X\) and \(\mu_Y\) are the means of \(X\) and \(Y\), respectively. The covariance can be positive, negative, or zero. A positive covariance indicates that the two random variables tend to increase or decrease together, while a negative covariance indicates that one random variable tends to increase when the other decreases. A covariance of zero indicates that the two random variables are linearly independent (but there may still be a non-linear dependency).

C.1.5.3 Correlation

The correlation is a standardized measure of the relationship between two random variables. The most common measure of correlation is the Pearson correlation coefficient that is simply the covariance divided by the product of the standard deviations of the two random variables:

\[ \rho_{XY} = \frac{\sigma_{XY}}{\sigma_X \sigma_Y} \]

The interpretation of the correlation coefficient is that it ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. By being a unitless and normalized quantity, it is easier to interpret and compare corrleation coefficients than covariances. It will also allow to specify the strength of the relationship between variables independent of their individual variation.

C.1.5.4 Properties of variances

The variance of a sum of random variables is not the sum of the variances (unless they are independent), but rather the sum of the variances plus the sum of the covariances. That is, for two random variables \(X\) and \(Y\):

\[ \sigma^2_{X + Y} =\sigma^2_{X} + \sigma^2_{Y} + 2\sigma^2_{X,Y} \]

The variance of a difference of random variables is the same as the variance of the sum of the random variables.

The variance of a constant times a random variable is the constant squared times the variance of the random variable:

\[ \sigma^2_{aX} = a^2 \sigma^2_{X}. \]

However, the variance of a constant matrix times a random variable is as follows:

\[ \sigma^2_{\boldsymbol{\mathrm{A}}X} = \boldsymbol{\mathrm{A}} \sigma^2_{X} \boldsymbol{\mathrm{A}}^T. \]

C.1.6 Ordinary least squares (OLS)

C.1.6.1 Point estimator of the mean

We define error or residual (\(\epsilon_i\)) as the difference between a value sampled from a random variable (\(y_i\)) and the mean of said random variable (\(mu_Y\)):

\[ \epsilon_i = y_i - \mu_Y. \]

The method of Ordinary Least Squares (OLS) is a statistical technique used to estimate the value of \(\mu_Y\) given a sample of size \(n\) by minimizing the sum of the squared errors:

\[ \widehat{\mu_Y} = \underset{\mu_Y}{\operatorname{argmin}} \sum_{i=1}^n (y_i - \mu_Y)^2. \]

We can get the solution by taking the first derivative and setting it to zero:

\[ 2 \sum_{i=1}^n -y_i + \widehat{\mu_Y} = 0, \]

from which we can derive the following expression for the OLS estimator:

\[ \widehat{\mu_Y} = \frac{1}{n} \sum_{i=1}^n y_i = \overline{y}. \]

C.1.6.2 Standard error of the mean

We can derive the standard error of the mean (\(\sigma_{\widehat{\mu_Y}}\)) by applying the properties of variances. That is, if \(\widehat{\mu_Y}\) is defined as:

\[ \widehat{\mu_Y} = \frac{1}{n} \sum_{i=1}^n y_i \]

The variance of \(\widehat{\mu_Y}\) becomes:

\[ \sigma^2_{\widehat{\mu_Y}} = \frac{1}{n^2} \sum_{i=1}^n \sigma^2_{y_i}, \]

But since the \(y_i\) are assumed to be i.i.d. (i.e. each observation is assumed to be drawn from a separate random variable but they are all assumed to follow the same distribution), they all have the same variance (\(\sigma^2_{Y}\)) and hence:

\[ \sigma^2_{\widehat{\mu_Y}} = \frac{\sigma^2_{Y}}{n}. \]

The standard error of the mean thus becomes:

\[ \sigma_{\widehat{\mu_Y}} = \frac{\sigma_{Y}}{\sqrt{n}}. \]

In practice, we do not know \(\sigma_{Y}\), so we need to use its estimate:

\[ \sigma_{\widehat{\mu_Y}} = \frac{\widehat{\sigma_{Y}}}{\sqrt{n}}. \]

C.1.6.3 Confidence interval of the mean

The standard error of the mean can be used to calculate the confidence interval of the mean using the t-distribution. The t-distribution is similar to the normal distribution but accounts for the uncertainty in the estimate of the standard deviation (i.e., the fact that we generally have to use \(\widehat{\sigma_{Y}}\) instead of \(\sigma_{Y}\)).

C.1.7 Maximum likelihood estimation (MLE)

The likelihood function is a fundamental concept in statistics that is used to estimate the parameters of a statistical model. The likelihood function is defined as the probability of observing the data given the parameters of the model. The likelihood function is often denoted by \(L(\theta|\boldsymbol{\mathrm{x}})\), where \(\theta\) is the parameter(s) of the model and \(X\) is the data.

The likelihood function is defined as the product of the probability density function of each data point, given the parameters of the model:

\[ L(\theta|\boldsymbol{\mathrm{x}}) = \prod_{i=1}^n f(x_i|\theta) \]

where \(f(x_i|\theta)\) is the probability density function of the data point \(x_i\) given the parameters \(\theta\) of the model.

The likelihood function is used to estimate the parameters of the model by finding the values of the parameters that maximize the likelihood function. This process is called maximum likelihood estimation. One problem with the likelihood function is that it can adopt very small numbers, especially when the data set is large, since we are often multiplying values smaller than 1. To avoid numerical problems, it is common to work with the log-likelihood function, which is defined as the natural logarithm of the likelihood function:

\[ \ell(\theta|\boldsymbol{\mathrm{x}}) = \log L(\theta|\boldsymbol{\mathrm{x}}) = \sum_{i=1}^n \log f(x_i|\theta) \]

The log-likelihood function has the same maximum as the likelihood function (because \(\log\) is a monotonic function), but it is easier to work with numerically since it involves sums instead of products.

When the probability density function of the data is a normal distribution, the maximum likelihood estimation is mathematically equivalent to the ordinary least squares method, hence it can be seen as a generalization of OLS. Indeed, the density function of the Normal distribution is:

\[ f(y) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{y-\mu}{\sigma}\right)^2} \]

where \(\mu\) is the mean and \(\sigma^2\) is the variance of the distribution. The log-likelihood thus becomes:

\[ \begin{align} \ell(\mu |\boldsymbol{\mathrm{y}}) &= \log \frac{1}{\sigma \sqrt{2\pi}} - \frac{1}{\sigma^2} \sum_{i=1}^n \left(x_i-\mu\right)^2 \\ &\propto -\sum_{i=1}^n \left(x_i-\mu\right)^2 = -\epsilon_i^2. \end{align} \]

We can see that maximizing \(\ell(\mu |\boldsymbol{\mathrm{y}})\) is equivalent to minimizing \(\epsilon_i^2\), that is, ordinary least squares, at least when the goal is estimating the mean.

As we will see later, in most models we will compute the solution to this problem numerically through linear algebra, so we will not need to calculate these functions explicitly. There are cases where that is required (e.g., non-linear models) but for now this is just to show the connection between OLS and MLE and illustrate the conceptual procedure. That is, regardless of the implementation details, the goal is to find the parameters that maximize the likelihood of the data given the model.

C.2 The general linear model

C.2.1 Generative form

A linear model defines a random vector \(\boldsymbol{Y}\) of length \(n\) that follows a multivariate Normal distribution:

\[ \boldsymbol{Y} \sim \mathrm{MVN}\left(\boldsymbol{\upmu_Y}, \boldsymbol{\Sigma_Y}\right) \]

where \(\boldsymbol{\upmu_Y}\) is the mean and \(\boldsymbol{\Sigma_Y}\) is the covariance matrix of the distribution. Technically, \(\boldsymbol{Y}\) represents a vector of random variables, but it is customary to speak of a single response variable in a linear regression model. To reconcile this, we will refer to \(\boldsymbol{Y}\) and samples from it as instances of the response variable, with the implicit understanding that each observation corresponds to a different random variable (rather than different samples from the same random variable).

The mean is assumed to be a linear combination of independent variables (known as predictors or covariates):

\[ \boldsymbol{\upmu_Y} = \beta_0 + \boldsymbol{x_1} \beta_1 + \ldots + \boldsymbol{x_p} \beta_p, \]

where \(\beta_0\) is a constant denoted as intercept, \(\boldsymbol{x_1}, \ldots, \boldsymbol{x_p}\) are the different predictors and \(\beta_1, \ldots, \beta_p\) are the coefficients of the linear model. In matrix notation this can be written as:

\[ \boldsymbol{\upmu_Y} = \boldsymbol{\mathrm{X}} \boldsymbol{\upbeta}, \]

where \(\boldsymbol{\mathrm{X}}\) is the design matrix of the model with \(n\) rows and \(p\) columns, \(\boldsymbol{\upbeta}\) is the vector of coefficients of the model. Note that \(\boldsymbol{\mathrm{X}}\) is not a random matrix, but rather fixed and known.

The covariance matrix is assumed to be diagonal with constant variance, that is:

\[ \boldsymbol{\Sigma_Y} = \sigma_Y^2 \boldsymbol{I}. \]

The linear model can thus be written as:

\[ \boldsymbol{Y} \sim \mathrm{MVN}\left(\boldsymbol{\mathrm{X}} \boldsymbol{\upbeta}, \sigma_Y^2 \boldsymbol{I}\right). \]

Alternatively, given that the covariance matrix is diagonal, we can also express the linear model for each element of the random vector \(\boldsymbol{Y}\) as a univariate Normal distribution, namely:

\[ y_i \sim \mathrm{N}\left(\boldsymbol{\mathrm{x_i}} \boldsymbol{\upbeta}, \sigma_Y \right), \]

where \(\boldsymbol{\mathrm{x_i}}\) is the vector of predictors for observation \(i\).

C.2.2 Regression form

However, for estimating the coefficients of the linear model using Ordinary Least Squares given a sample of data we will express the model in regression form. First let’s remember (or learn) that any Normal distribution can be written as the sum of a mean and a centered Normal distribution, that is:

\[ \mathrm{N}(\mu, \sigma^2) = \mu + N(0, \sigma) \]

and for a multivariate Normal distribution with a diagonal constant covariance matrix, this is:

\[ \mathrm{MVN}(\boldsymbol{\upmu}, \boldsymbol{\Sigma}) = \boldsymbol{\upmu} + \mathrm{MVN}(\boldsymbol{0}, \sigma_Y \boldsymbol{\mathrm{I}}). \]

This means that we can write the linear model as:

\[ \begin{align} \boldsymbol{\mathrm{y}} &= \boldsymbol{\upmu} + \boldsymbol{\upepsilon} \\ &= \boldsymbol{\mathrm{X}} \boldsymbol{\upbeta} + \boldsymbol{\upepsilon}, \end{align} \]

where \(\boldsymbol{\mathrm{y}}\) is a sample from \(\boldsymbol{Y}\), and \(\boldsymbol{\upepsilon}\) is a vector of residuals (i.e., the difference between the observed and fitted values) from the distribution \(\mathrm{N}(\boldsymbol{0}, \sigma_Y \boldsymbol{\mathrm{I}})\).

Alternatively, we can write the linear model for each element of the random vector \(\boldsymbol{\mathrm{y}}\) as:

\[ y_i = \boldsymbol{\mathrm{x_i}} \boldsymbol{\upbeta} + \upepsilon_i, \]

where \(\upepsilon_i\) is the residual for observation \(i\), assumed to be a sample from \(\mathrm{N}\left(0, \sigma_Y\right)\).

C.2.2.1 Assumptions of linear regression

From the definitions above we can better understand now the assumptions of linear regression models:

  1. The predictor variables (\(\boldsymbol{\mathrm{x_i}}\)) are fixed and known (weak exogeneity).
  2. The average values of the instances of the response variable (\(y_i\)) are a linear combination of coefficients and predictors (linearity).
  3. All instances of the response variable have the same variance (homoscedasticity).
  4. The instances of the response variable are independent of each other (independence).
  5. No perfect linear relationship exist among two or more predictor variables (no perfect multicollinearity).
  6. The expected value of the residuals is zero (mean of residuals is zero).

C.3 Linear regression (ordinary least squares)

C.3.1 Estimating coefficients

The OLS solution for \(\boldsymbol{\upbeta}\) requires minimizing \(\sum_{i=0}^{n} \widehat{\epsilon_{i}}^2\):

\[ \begin{align} \sum_{i=0}^{n} \widehat{\epsilon_{i}}^2 &= \sum_{i=0}^{n} (y_i - \widehat{y_i})^2 \\ &= \sum_{i=0}^{n} \left(y_i - \sum_{j=0}^{p} x_{i,j} \widehat{\beta_j} \right)^2 \\ &= \sum_{i=0}^{n} \left(y_i^2 + \sum_{j=0}^{p} x_{i,j}^2 \widehat{\beta_j}^2 - 2 y_i \sum_{j=0}^{p} x_{i,j} \widehat{\beta_j} \right). \end{align} \]

We now take the derivative with respect to \(\widehat{\beta_j}\):

\[ \begin{align} \frac{\partial \sum_{i=0}^{n} \widehat{\epsilon_{i}}^2 }{\partial \widehat{\beta_j}} &= \sum_{i=0}^{n} \left(2 \sum_{j=0}^{p} x_{i,j}^2 \widehat{\beta_j} - 2 y_i \sum_{j=0}^{p} x_{i,j} \right)\\ &= 2 \sum_{i=0}^{n} \sum_{j=0}^{p} x_{i,j}^2 \widehat{\beta_j} - x_{i,j} y_i . \end{align} \]

Setting this to zero to find the minimum \(\sum_{i=0}^{n} \widehat{\epsilon_{i}}^2\) leads to

\[ \sum_{i=0}^{n} \sum_{j=0}^{p} x_{i,j}^2 \widehat{\beta_j} = \sum_{i=0}^{n} x_{i,j} y_i. \]

We can rewrite this by considering \(\boldsymbol{\mathrm{x_{i}}}\) as the vector with all predictors for observation \(i\) and \(\boldsymbol{\widehat{\upbeta}}\) as the vector with all coefficients. This leads to:

\[ \left( \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}}^2 \right) \boldsymbol{\widehat{\upbeta}} = \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}} y_i. \]

We can then left-multiply both sides by \(\left( \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}}^2 \right)^{-1}\):

\[ \left( \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}}^2 \right)^{-1} \left( \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}}^2 \right) \boldsymbol{\widehat{\upbeta}} = \left( \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}}^2 \right)^{-1} \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}} y_i, \]

leading to

\[ \boldsymbol{\mathrm{I}} \boldsymbol{\widehat{\upbeta}} = \boldsymbol{\widehat{\upbeta}} = \left( \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}}^2 \right)^{-1} \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}} y_i. \]

In practice, rather than squaring \(\boldsymbol{\mathrm{x_{i}}}\) one uses the transpose of \(\boldsymbol{\mathrm{x_{i}}}\) since

\[ \boldsymbol{\mathrm{x_{i}}}^T \boldsymbol{\mathrm{x_{i}}} = \boldsymbol{\mathrm{x_{i}}}^2, \]

leading to the expression:

\[ \boldsymbol{\widehat{\upbeta}} = \left( \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}}^T \boldsymbol{\mathrm{x_{i}}} \right)^{-1} \sum_{i=0}^{n} \boldsymbol{\mathrm{x_{i}}} y_i. \]

Finally we can write this in terms of the design matrix \(\boldsymbol{X}\) since both sums over \(i\) above are equivalent to matrix-matrix products, such that:

\[ \boldsymbol{\widehat{\upbeta}} = \left( \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}} \right)^{-1} \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{y}}, \]

where \(\boldsymbol{\mathrm{X}}\) represents the matrix of predictors with \(n\) rows and \(p\) columns (known as design matrix), \(\boldsymbol{\mathrm{y}}\) is the vector of observations and \(\boldsymbol{\widehat{\upbeta}}\) is the vector of coefficients. \(\boldsymbol{\mathrm{X}}\) will correspond to the data collected on the predictor variables (one column per variable plus an intercept, each row corresponding to an observation) and \(\boldsymbol{\mathrm{y}}\) will correspond to the data collected on the response variable.

From the above, we can see that given \(\boldsymbol{\mathrm{X}}\) and \(\boldsymbol{\mathrm{y}}\), we can calculate \(\boldsymbol{\widehat{\upbeta}}\) through linear algebra operations. In R this would correspond to:

solve(t(X) %*% X) %*% t(X) %*% y

where solve calculates the inverse of a matrix, t transposes a matrix and %*% is the matrix-matrix multiplication operator. When we use the lm function in R, this is what is happening under the hood. It will first create the design matrix X based on the model formula and the data provided. Then it will apply linear algebra operations to calculate the coefficients.

C.3.2 Estimating standard errors

C.3.2.1 Standard errors of the coefficients

The standard errors of the coefficients can be estimated by applying the rules for variance to the OLS solution. Given this:

\[ \boldsymbol{\widehat{\upbeta}} = \left( \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}} \right)^{-1} \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{y}}, \]

and knowing that the variance of a constant matrix times a random variable \(\boldsymbol{\mathrm{A}} Y\) is given by \(\boldsymbol{\mathrm{A}} \mathrm{var}(Y) \boldsymbol{\mathrm{A^T}}\), then

\[ \boldsymbol{\sigma_{\widehat{\upbeta}}^2} = \left( \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}} \right)^{-1} \boldsymbol{\mathrm{X}}^T \widehat{\sigma_Y^2} \boldsymbol{\mathrm{X}} \left( \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}} \right)^{-1}. \]

Since \(\widehat{\sigma_Y^2}\) is a constant in a linear regression model, we can reorganize the right hand side:

\[ \boldsymbol{\sigma_{\widehat{\upbeta}}^2} = \widehat{\sigma_Y^2} \left( \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}} \right)^{-1} \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}} \left( \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}} \right)^{-1} \]

and now it is easier to see that \(\left(\boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}}\right)^{-1}\) and \(\boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}}\) cancel out, leading to:

\[ \boldsymbol{\sigma_{\widehat{\upbeta}}^2} = \widehat{\sigma_Y^2} \left( \boldsymbol{\mathrm{X}}^T \boldsymbol{\mathrm{X}} \right)^{-1}. \]

C.3.3 Standard error of the response

To calculate \(\widehat{\sigma_Y^2}\) we need to go back to the definition of the linear model but defining \(\boldsymbol{\widehat{\mathrm{y}}} = \boldsymbol{\mathrm{X}} \boldsymbol{\upbeta}\), (know as fitted values), such that:

\[ \boldsymbol{\mathrm{y}} = \boldsymbol{\widehat{\mathrm{y}}} + \boldsymbol{\epsilon}. \]

In OLS, one assumes that the residuals (\(\boldsymbol{\epsilon}\)) are a sample from a normal distribution with mean zero. We can then show that \(\boldsymbol{\widehat{\mathrm{y}}}\) is the mean of the distribution of the random vector \(\boldsymbol{\mathrm{Y}}\) from where \(\boldsymbol{\mathrm{y}}\) was sampled:

\[ \begin{align} E[\boldsymbol{\mathrm{y}}] &= E[\boldsymbol{\widehat{\mathrm{y}}} + \boldsymbol{\epsilon}] \\ &= E[\boldsymbol{\widehat{\mathrm{y}}}] + E[\boldsymbol{\epsilon}] \\ &= \boldsymbol{\widehat{\mathrm{y}}} + 0 \\ &= \boldsymbol{\widehat{\mathrm{y}}} = \boldsymbol{\upmu_Y}. \end{align} \]

Applying the formula of the variance to the random vector \(\boldsymbol{\mathrm{Y}}\) :

\[ \begin{align} \sigma_Y^2 &= E[(\boldsymbol{\mathrm{Y}} - \boldsymbol{\upmu_Y})^2] \\ &= E[(\boldsymbol{\mathrm{Y}} - \boldsymbol{\widehat{\mathrm{y}}})^2] \\ &= E[(\boldsymbol{\widehat{\mathrm{y}}} + \boldsymbol{\epsilon} - \boldsymbol{\widehat{\mathrm{y}}})^2] \\ &=E[\boldsymbol{\epsilon}^2] = \sigma_{\epsilon}^2. \end{align} \]

\(\sigma_{\epsilon}\) is the standard deviation of the errors or residuals (in lm’s output this is reported as residual standard error). It can be estimated as:

\[ \widehat{\sigma_{\epsilon}} = \sqrt{\frac{\sum_{i=0}^{n} \widehat{\epsilon_{i}}^2}{n-p}}, \]

where \(p\) is the number of coefficients in the model (including the intercept).