Linear models cannot be estimated when regressors are perfectly correlated, and their coefficients have large variances when regressors are almost-perfectly correlated. But how does coefficients’ correlation depend on regressors’ correlation?


The correlation of the $$x_i$$ and $$z_i$$ also determines the precision with which $$(\beta_1\pm\beta_2)$$ can be estimated. In particular, the expression for $$\Var(\hat\beta)$$ above implies $$\Var(\hat\beta_1\pm\hat\beta_2)=\frac{2\sigma^2}{n(1\pm\rho)}$$ for $$\abs{\rho}<1$$. As the $$x_i$$ and $$z_i$$ become more correlated (i.e., $$\rho$$ rises), over-estimates of $$\beta_1$$ must increasingly coincide with under-estimates of $$\beta_2$$, and so the estimate of $$(\beta_1+\beta_2)$$ becomes more precise because the errors cancel out. Conversely, the estimate of $$(\beta_1-\beta_2)$$ becomes less precise as $$\rho$$ rises because the errors in $$\hat\beta_1$$ and $$\hat\beta_2$$ amplify each other.

One application of this relationship between $$\Var(\hat\beta_1\pm\hat\beta_2)$$ and $$\rho$$ is to experimental design. Suppose I want to estimate the effect of receiving two treatments—say, doses of a single vaccine—on some outcome of interest. The $$x_i$$ and $$z_i$$ indicate whether individual $$i$$ receives each dose, the coefficients $$\beta_1$$ and $$\beta_2$$ are the average treatment effects (ATEs) of receiving each dose, and the sum $$(\beta_1+\beta_2)$$ is the ATE of receiving both doses. The most precise estimate of $$(\beta_1+\beta_2)$$ obtains when the treatments are perfectly positively correlated: that is, when people receive either zero or two doses, but no-one receives only one. Intuitively, I learn more about the effect of receiving two doses from people who receive both than from people who receive only one, so the most informative experiment cannot have anyone who receives a single dose.

On the other hand, suppose I want to compare the effect of two distinct treatments—say, doses of different vaccines—on my outcome of interest. Then I want to estimate $$(\beta_1-\beta_2)$$, which I can do most precisely when the treatments are perfectly negatively correlated: that is, when people receive one type of vaccine or the other, but no-one receives both. Intuitively, I learn more about the vaccines’ relative effects from people who receive one type than from people who receive both types because the two vaccines may have confounding effects.

Thanks to Lautaro Chittaro for inspiring this post and commenting on a draft.