Linear models cannot be estimated when regressors are perfectly correlated, and their coefficients have large variances when regressors are almost-perfectly correlated. But how does coefficients’ correlation depend on regressors’ correlation?

To answer this question, suppose I have data \((y_i,x_i,z_i)_{i=1}^n\) generated by the process $$\newcommand{\abs}[1]{\lvert#1\rvert} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Var}{Var} \renewcommand{\epsilon}{\varepsilon} y_i=\beta_1x_i+\beta_2z_i+\epsilon_i,$$ where the \(x_i\) and \(z_i\) are normalized to have zero mean and unit variance, and where the \(\epsilon_i\) are iid with zero mean and zero correlation with the \(x_i\) and \(z_i\). If the \(x_i\) and \(z_i\) are not perfectly correlated then the OLS estimator \(\hat\beta\) of the coefficient vector \((\beta_1,\beta_2)\) has variance $$\DeclareMathOperator{\Var}{Var} \Var(\hat\beta)=\frac{\sigma^2}{n(1-\rho^2)}\begin{bmatrix}1&-\rho\\-\rho&1\end{bmatrix},$$ where \(\sigma^2\) is the variance of the \(\epsilon_i\), and where \(\rho\) is the (empirical) correlation of the \(x_i\) and \(z_i\). It follows that $$\Cor(\hat\beta_1,\hat\beta_2)=-\Cor(x_i,z_i)$$ whenever the \(x_i\) and \(z_i\) are not perfectly correlated. As their correlation grows, the mean slope of the data in the directions spanned by the \(x_i\) and \(z_i\) approaches \((\beta_1+\beta_2)\), and so the OLS estimates \(\hat\beta_1\) and \(\hat\beta_2\) increasingly “compete” for contributions to their sum: if sampling error leads to one coefficient being over-estimated then the other coefficient must be under-estimated to preserve the sum. This competition drives the decreasing correlation of \(\hat\beta_1\) and \(\hat\beta_2\) as the \(x_i\) and \(z_i\) become more correlated.

The correlation of the \(x_i\) and \(z_i\) also determines the precision with which \((\beta_1\pm\beta_2)\) can be estimated. In particular, the expression for \(\Var(\hat\beta)\) above implies $$\Var(\hat\beta_1\pm\hat\beta_2)=\frac{2\sigma^2}{n(1\pm\rho)}$$ for \(\abs{\rho}<1\). As the \(x_i\) and \(z_i\) become more correlated (i.e., \(\rho\) rises), over-estimates of \(\beta_1\) must increasingly coincide with under-estimates of \(\beta_2\), and so the estimate of \((\beta_1+\beta_2)\) becomes more precise because the errors cancel out. Conversely, the estimate of \((\beta_1-\beta_2)\) becomes less precise as \(\rho\) rises because the errors in \(\hat\beta_1\) and \(\hat\beta_2\) amplify each other.

One application of this relationship between \(\Var(\hat\beta_1\pm\hat\beta_2)\) and \(\rho\) is to experimental design. Suppose I want to estimate the effect of receiving two treatments—say, doses of a single vaccine—on some outcome of interest. The \(x_i\) and \(z_i\) indicate whether individual \(i\) receives each dose, the coefficients \(\beta_1\) and \(\beta_2\) are the average treatment effects (ATEs) of receiving each dose, and the sum \((\beta_1+\beta_2)\) is the ATE of receiving both doses. The most precise estimate of \((\beta_1+\beta_2)\) obtains when the treatments are perfectly positively correlated: that is, when people receive either zero or two doses, but no-one receives only one. Intuitively, I learn more about the effect of receiving two doses from people who receive both than from people who receive only one, so the most informative experiment cannot have anyone who receives a single dose.

On the other hand, suppose I want to compare the effect of two distinct treatments—say, doses of different vaccines—on my outcome of interest. Then I want to estimate \((\beta_1-\beta_2)\), which I can do most precisely when the treatments are perfectly negatively correlated: that is, when people receive one type of vaccine or the other, but no-one receives both. Intuitively, I learn more about the vaccines’ relative effects from people who receive one type than from people who receive both types because the two vaccines may have confounding effects.


Thanks to Lautaro Chittaro for inspiring this post and commenting on a draft.