Coefficients of correlated regressors

Linear models cannot be estimated when regressors are perfectly correlated, and their coefficients have large variances when regressors are almost-perfectly correlated. But how does coefficients’ correlation depend on regressors’ correlation?

To answer this question, suppose I have data generated by the process where the and are normalized to have zero mean and unit variance, and where the are iid with zero mean and zero correlation with the and . If the and are not perfectly correlated then the OLS estimator of the coefficient vector has variance where is the variance of the , and where is the (empirical) correlation of the and . It follows that whenever the and are not perfectly correlated. As their correlation grows, the mean slope of the data in the directions spanned by the and approaches , and so the OLS estimates and increasingly “compete” for contributions to their sum: if sampling error leads to one coefficient being over-estimated then the other coefficient must be under-estimated to preserve the sum. This competition drives the decreasing correlation of and as the and become more correlated.

The correlation of the and also determines the precision with which can be estimated. In particular, the expression for above implies for . As the and become more correlated (i.e., rises), over-estimates of must increasingly coincide with under-estimates of , and so the estimate of becomes more precise because the errors cancel out. Conversely, the estimate of becomes less precise as rises because the errors in and amplify each other.

One application of this relationship between and is to experimental design. Suppose I want to estimate the effect of receiving two treatments—say, doses of a single vaccine—on some outcome of interest. The and indicate whether individual receives each dose, the coefficients and are the average treatment effects (ATEs) of receiving each dose, and the sum is the ATE of receiving both doses. The most precise estimate of obtains when the treatments are perfectly positively correlated: that is, when people receive either zero or two doses, but no-one receives only one. Intuitively, I learn more about the effect of receiving two doses from people who receive both than from people who receive only one, so the most informative experiment cannot have anyone who receives a single dose.

On the other hand, suppose I want to compare the effect of two distinct treatments—say, doses of different vaccines—on my outcome of interest. Then I want to estimate , which I can do most precisely when the treatments are perfectly negatively correlated: that is, when people receive one type of vaccine or the other, but no-one receives both. Intuitively, I learn more about the vaccines’ relative effects from people who receive one type than from people who receive both types because the two vaccines may have confounding effects.

Thanks to Lautaro Chittaro for inspiring this post and commenting on a draft.