Suppose we have data {(xi,yi):i{1,2,,n}} generated by the process yi=βxi+ui, where the ui are random errors with zero means, equal variances, and zero correlations with the xi. This data generating process (DGP) satisfies the Gauss-Markov assumptions, so we can obtain an unbiased estimate β^ of the coefficient β using ordinary least squares (OLS).

Now suppose we restrict our data to observations with xi0 or yi0. How will these restrictions change β^?

To investigate, let’s create some toy data:

library(dplyr)

n <- 100
set.seed(0)
df <- tibble(x = rnorm(n), u = rnorm(n), y = x + u)

Here xi and ui are standard normal random variables, and yi=xi+ui for each observation i{1,2,,100}. Thus β=1. The OLS estimate of β is β^=Cov(x,y)Var(x), where x=(x1,x2,,x100) and y=(y1,y2,,y100) are data vectors, Cov is the covariance operator, and Var is the variance operator. For these data, we have

cov(df$x, df$y) / var(df$x)
## [1] 1.138795

as our estimate with no selection.

Next, let’s introduce our selection criteria:

df <- df %>%
  tidyr::crossing(criterion = c('x >= 0', 'y >= 0')) %>%
  rowwise() %>%  # eval is annoying to vectorise
  mutate(selected = eval(parse(text = criterion))) %>%
  ungroup()

df
## # A tibble: 200 x 5
##        x       u      y criterion selected
##    <dbl>   <dbl>  <dbl> <chr>     <lgl>   
##  1 -2.22 -0.0125 -2.24  x >= 0    FALSE   
##  2 -2.22 -0.0125 -2.24  y >= 0    FALSE   
##  3 -1.56 -1.12   -2.68  x >= 0    FALSE   
##  4 -1.56 -1.12   -2.68  y >= 0    FALSE   
##  5 -1.54  0.577  -0.963 x >= 0    FALSE   
##  6 -1.54  0.577  -0.963 y >= 0    FALSE   
##  7 -1.44 -1.39   -2.83  x >= 0    FALSE   
##  8 -1.44 -1.39   -2.83  y >= 0    FALSE   
##  9 -1.43 -0.543  -1.97  x >= 0    FALSE   
## 10 -1.43 -0.543  -1.97  y >= 0    FALSE   
## # … with 190 more rows

Now df contains two copies of each observation—one for each selection criterion—and an indicator for whether the observation is selected by each criterion. We can use df to estimate OLS coefficients and their standard errors among observations with xi0 and yi0:

df %>%
  filter(selected) %>%
  group_by(criterion) %>%
  summarise(n = n(),
            estimate = cov(x, y) / var(x),
            std.error = sd(y - estimate * x) / sqrt(n))
## # A tibble: 2 x 4
##   criterion     n estimate std.error
##   <chr>     <int>    <dbl>     <dbl>
## 1 x >= 0       48    1.02      0.136
## 2 y >= 0       47    0.356     0.110

The OLS estimate among observations with xi0 approximates the true value β=1 well. However, the estimate among observations with yi0 is much smaller than one. We can confirm this visually:

What’s going on? Why do we get biased OLS estimates of β among observations with yi0 but not among observations with xi0?

The key is to think about the errors ui in each case. Since the xi and ui are independent, selecting observations with xi0 leaves the distributions of the ui unchanged—they still have zero means, equal variances, and zero correlations with the xi. Thus, the Gauss-Markov assumptions still hold and we still obtain unbiased OLS estimates of β.

In contrast, the xi and ui are negatively correlated among observations with yi0. To see why, notice that if yi=xi+ui then yi0 if and only if xiui. So if xi is low then ui must be high (and vice versa) for the observation to be selected. Thus, among selected observations, we have ui=ρxi+εi, where ρ<0 indexes (and, in this case, equals) the correlation between the xi and ui, and where the residuals εi are uncorrelated with the xi. Our DGP then becomes yi=(β+ρ)xi+εi. The εi have equal variances (equal to 1+ρ2 in this case) and, again, are uncorrelated with the xi. Therefore, the OLS estimate ρ^=Cov(u,x)Var(x) of ρ is unbiased1, and for our toy data equals ρ^0.644 among observations with yi0. Subtracting ρ^ from β^ then gives β^ρ^0.356(0.644)=1, recovering the true value β=1.

The table below reports 95% confidence intervals for β^, ρ^, and (β^ρ^), estimated by simulating the DGP yi=xi+ui described above 100 times. The table confirms that the OLS estimate β^ of β=1 is unbiased among observations with xi0 but biased negatively among observations with yi0.

Observations β^ ρ^ β^ρ^
All 1.005 ± 0.002 0.005 ± 0.002 1.000 ± 0.000
With xi0 1.001 ± 0.004 0.001 ± 0.004 1.000 ± 0.000
With yi0 0.547 ± 0.003 -0.453 ± 0.003 1.000 ± 0.000

The estimate β^ always differs from β by ρ^, which is significantly non-zero among observations with yi0. However, this pattern is not useful empirically because we generally don’t observe the ui and so can’t estimate ρ^ to back out the true value of β=β^ρ^. Instead, we may use the Heckman correction to adjust for the bias introduced through non-random selection.

In empirical settings, selecting observations with xi0 may lead to biased estimates when (i) there is heterogeneity in the relationship between yi and xi across observations i, and (ii) OLS is used to estimate an average treatment effect.2 In particular, if the xi are correlated with the observation-specific treatment effects then restricting to observations with xi0 changes the distribution, and hence the mean, of those effects non-randomly.


  1. We can rewrite εi=α+(εiα), where α is the mean of the εi, and where the (εiα) have zero means, equal variances, and zero correlations with the xi. ↩︎

  2. Thanks to Shakked for pointing this out. ↩︎