Understanding selection bias

Suppose we have data ${(x_{i}, y_{i}) : i \in {1, 2, \dots, n}}$ generated by the process $y_{i} = β x_{i} + u_{i},$ where the $u_{i}$ are random errors with zero means, equal variances, and zero correlations with the $x_{i}$ . This data generating process (DGP) satisfies the Gauss-Markov assumptions, so we can obtain an unbiased estimate $^β$ of the coefficient $β$ using ordinary least squares (OLS).

Now suppose we restrict our data to observations with $x_{i} \geq 0$ or $y_{i} \geq 0$ . How will these restrictions change $^β$ ?

To investigate, let’s create some toy data:

library(dplyr)

n <- 100
set.seed(0)
df <- tibble(x = rnorm(n), u = rnorm(n), y = x + u)

Here $x_{i}$ and $u_{i}$ are standard normal random variables, and $y_{i} = x_{i} + u_{i}$ for each observation $i \in {1, 2, \dots, 100}$ . Thus $β = 1$ . The OLS estimate of $β$ is $^β = \frac{Cov (x, y)}{Var (x)},$ where $x = (x_{1}, x_{2}, \dots, x_{100})$ and $y = (y_{1}, y_{2}, \dots, y_{100})$ are data vectors, $Cov$ is the covariance operator, and $Var$ is the variance operator. For these data, we have

cov(df$x, df$y) / var(df$x)

## [1] 1.138795

as our estimate with no selection.

Next, let’s introduce our selection criteria:

df <- df %>%
  tidyr::crossing(criterion = c('x >= 0', 'y >= 0')) %>%
  rowwise() %>%  # eval is annoying to vectorise
  mutate(selected = eval(parse(text = criterion))) %>%
  ungroup()

df

## # A tibble: 200 x 5
##        x       u      y criterion selected
##    <dbl>   <dbl>  <dbl> <chr>     <lgl>   
##  1 -2.22 -0.0125 -2.24  x >= 0    FALSE   
##  2 -2.22 -0.0125 -2.24  y >= 0    FALSE   
##  3 -1.56 -1.12   -2.68  x >= 0    FALSE   
##  4 -1.56 -1.12   -2.68  y >= 0    FALSE   
##  5 -1.54  0.577  -0.963 x >= 0    FALSE   
##  6 -1.54  0.577  -0.963 y >= 0    FALSE   
##  7 -1.44 -1.39   -2.83  x >= 0    FALSE   
##  8 -1.44 -1.39   -2.83  y >= 0    FALSE   
##  9 -1.43 -0.543  -1.97  x >= 0    FALSE   
## 10 -1.43 -0.543  -1.97  y >= 0    FALSE   
## # … with 190 more rows

Now df contains two copies of each observation—one for each selection criterion—and an indicator for whether the observation is selected by each criterion. We can use df to estimate OLS coefficients and their standard errors among observations with $x_{i} \geq 0$ and $y_{i} \geq 0$ :

df %>%
  filter(selected) %>%
  group_by(criterion) %>%
  summarise(n = n(),
            estimate = cov(x, y) / var(x),
            std.error = sd(y - estimate * x) / sqrt(n))

## # A tibble: 2 x 4
##   criterion     n estimate std.error
##   <chr>     <int>    <dbl>     <dbl>
## 1 x >= 0       48    1.02      0.136
## 2 y >= 0       47    0.356     0.110

The OLS estimate among observations with $x_{i} \geq 0$ approximates the true value $β = 1$ well. However, the estimate among observations with $y_{i} \geq 0$ is much smaller than one. We can confirm this visually:

What’s going on? Why do we get biased OLS estimates of $β$ among observations with $y_{i} \geq 0$ but not among observations with $x_{i} \geq 0$ ?

The key is to think about the errors $u_{i}$ in each case. Since the $x_{i}$ and $u_{i}$ are independent, selecting observations with $x_{i} \geq 0$ leaves the distributions of the $u_{i}$ unchanged—they still have zero means, equal variances, and zero correlations with the $x_{i}$ . Thus, the Gauss-Markov assumptions still hold and we still obtain unbiased OLS estimates of $β$ .

In contrast, the $x_{i}$ and $u_{i}$ are negatively correlated among observations with $y_{i} \geq 0$ . To see why, notice that if $y_{i} = x_{i} + u_{i}$ then $y_{i} \geq 0$ if and only if $x_{i} \geq - u_{i}$ . So if $x_{i}$ is low then $u_{i}$ must be high (and vice versa) for the observation to be selected. Thus, among selected observations, we have $u_{i} = ρ x_{i} + ε_{i},$ where $ρ < 0$ indexes (and, in this case, equals) the correlation between the $x_{i}$ and $u_{i}$ , and where the residuals $ε_{i}$ are uncorrelated with the $x_{i}$ . Our DGP then becomes $y_{i} = (β + ρ) x_{i} + ε_{i} .$ The $ε_{i}$ have equal variances (equal to $1 + ρ^{2}$ in this case) and, again, are uncorrelated with the $x_{i}$ . Therefore, the OLS estimate $^ρ = \frac{Cov (u, x)}{Var (x)}$ of $ρ$ is unbiased¹, and for our toy data equals $^ρ \approx - 0.644$ among observations with $y_{i} \geq 0$ . Subtracting $^ρ$ from $^β$ then gives recovering the true value .

The table below reports 95% confidence intervals for , , and , estimated by simulating the DGP described above 100 times. The table confirms that the OLS estimate of is unbiased among observations with but biased negatively among observations with .

Observations
All	1.005 ± 0.002	0.005 ± 0.002	1.000 ± 0.000
With	1.001 ± 0.004	0.001 ± 0.004	1.000 ± 0.000
With	0.547 ± 0.003	-0.453 ± 0.003	1.000 ± 0.000

The estimate always differs from by , which is significantly non-zero among observations with . However, this pattern is not useful empirically because we generally don’t observe the and so can’t estimate to back out the true value of . Instead, we may use the Heckman correction to adjust for the bias introduced through non-random selection.

In empirical settings, selecting observations with may lead to biased estimates when (i) there is heterogeneity in the relationship between and across observations , and (ii) OLS is used to estimate an average treatment effect.² In particular, if the are correlated with the observation-specific treatment effects then restricting to observations with changes the distribution, and hence the mean, of those effects non-randomly.

We can rewrite , where is the mean of the , and where the have zero means, equal variances, and zero correlations with the . ↩︎
Thanks to Shakked for pointing this out. ↩︎