Suppose we have data {(xi,yi):i{1,2,,n}}{(xi,yi):i{1,2,,n}} generated by the process yi=βxi+ui,yi=βxi+ui, where the uiui are random errors with zero means, equal variances, and zero correlations with the xixi. This data generating process (DGP) satisfies the Gauss-Markov assumptions, so we can obtain an unbiased estimate ˆβ^β of the coefficient ββ using ordinary least squares (OLS).

Now suppose we restrict our data to observations with xi0xi0 or yi0yi0. How will these restrictions change ˆβ^β?

To investigate, let’s create some toy data:

library(dplyr)

n <- 100
set.seed(0)
df <- tibble(x = rnorm(n), u = rnorm(n), y = x + u)

Here xixi and uiui are standard normal random variables, and yi=xi+uiyi=xi+ui for each observation i{1,2,,100}i{1,2,,100}. Thus β=1β=1. The OLS estimate of ββ is ˆβ=Cov(x,y)Var(x),^β=Cov(x,y)Var(x), where x=(x1,x2,,x100)x=(x1,x2,,x100) and y=(y1,y2,,y100)y=(y1,y2,,y100) are data vectors, CovCov is the covariance operator, and VarVar is the variance operator. For these data, we have

cov(df$x, df$y) / var(df$x)
## [1] 1.138795

as our estimate with no selection.

Next, let’s introduce our selection criteria:

df <- df %>%
  tidyr::crossing(criterion = c('x >= 0', 'y >= 0')) %>%
  rowwise() %>%  # eval is annoying to vectorise
  mutate(selected = eval(parse(text = criterion))) %>%
  ungroup()

df
## # A tibble: 200 x 5
##        x       u      y criterion selected
##    <dbl>   <dbl>  <dbl> <chr>     <lgl>   
##  1 -2.22 -0.0125 -2.24  x >= 0    FALSE   
##  2 -2.22 -0.0125 -2.24  y >= 0    FALSE   
##  3 -1.56 -1.12   -2.68  x >= 0    FALSE   
##  4 -1.56 -1.12   -2.68  y >= 0    FALSE   
##  5 -1.54  0.577  -0.963 x >= 0    FALSE   
##  6 -1.54  0.577  -0.963 y >= 0    FALSE   
##  7 -1.44 -1.39   -2.83  x >= 0    FALSE   
##  8 -1.44 -1.39   -2.83  y >= 0    FALSE   
##  9 -1.43 -0.543  -1.97  x >= 0    FALSE   
## 10 -1.43 -0.543  -1.97  y >= 0    FALSE   
## # … with 190 more rows

Now df contains two copies of each observation—one for each selection criterion—and an indicator for whether the observation is selected by each criterion. We can use df to estimate OLS coefficients and their standard errors among observations with xi0xi0 and yi0yi0:

df %>%
  filter(selected) %>%
  group_by(criterion) %>%
  summarise(n = n(),
            estimate = cov(x, y) / var(x),
            std.error = sd(y - estimate * x) / sqrt(n))
## # A tibble: 2 x 4
##   criterion     n estimate std.error
##   <chr>     <int>    <dbl>     <dbl>
## 1 x >= 0       48    1.02      0.136
## 2 y >= 0       47    0.356     0.110

The OLS estimate among observations with xi0xi0 approximates the true value β=1β=1 well. However, the estimate among observations with yi0yi0 is much smaller than one. We can confirm this visually:

What’s going on? Why do we get biased OLS estimates of ββ among observations with yi0yi0 but not among observations with xi0xi0?

The key is to think about the errors uiui in each case. Since the xixi and uiui are independent, selecting observations with xi0xi0 leaves the distributions of the uiui unchanged—they still have zero means, equal variances, and zero correlations with the xixi. Thus, the Gauss-Markov assumptions still hold and we still obtain unbiased OLS estimates of ββ.

In contrast, the xixi and uiui are negatively correlated among observations with yi0yi0. To see why, notice that if yi=xi+uiyi=xi+ui then yi0yi0 if and only if xiuixiui. So if xixi is low then uiui must be high (and vice versa) for the observation to be selected. Thus, among selected observations, we have ui=ρxi+εi,ui=ρxi+εi, where ρ<0ρ<0 indexes (and, in this case, equals) the correlation between the xixi and uiui, and where the residuals εiεi are uncorrelated with the xixi. Our DGP then becomes yi=(β+ρ)xi+εi.yi=(β+ρ)xi+εi. The εiεi have equal variances (equal to 1+ρ21+ρ2 in this case) and, again, are uncorrelated with the xixi. Therefore, the OLS estimate ˆρ=Cov(u,x)Var(x)^ρ=Cov(u,x)Var(x) of ρρ is unbiased1, and for our toy data equals ˆρ0.644^ρ0.644 among observations with yi0yi0. Subtracting ˆρ^ρ from ˆβ^β then gives ˆβˆρ0.356(0.644)=1, recovering the true value β=1.

The table below reports 95% confidence intervals for ˆβ, ˆρ, and (ˆβˆρ), estimated by simulating the DGP yi=xi+ui described above 100 times. The table confirms that the OLS estimate ˆβ of β=1 is unbiased among observations with xi0 but biased negatively among observations with yi0.

Observations ˆβ ˆρ ˆβˆρ
All 1.005 ± 0.002 0.005 ± 0.002 1.000 ± 0.000
With xi0 1.001 ± 0.004 0.001 ± 0.004 1.000 ± 0.000
With yi0 0.547 ± 0.003 -0.453 ± 0.003 1.000 ± 0.000

The estimate ˆβ always differs from β by ˆρ, which is significantly non-zero among observations with yi0. However, this pattern is not useful empirically because we generally don’t observe the ui and so can’t estimate ˆρ to back out the true value of β=ˆβˆρ. Instead, we may use the Heckman correction to adjust for the bias introduced through non-random selection.

In empirical settings, selecting observations with xi0 may lead to biased estimates when (i) there is heterogeneity in the relationship between yi and xi across observations i, and (ii) OLS is used to estimate an average treatment effect.2 In particular, if the xi are correlated with the observation-specific treatment effects then restricting to observations with xi0 changes the distribution, and hence the mean, of those effects non-randomly.


  1. We can rewrite εi=α+(εiα), where α is the mean of the εi, and where the (εiα) have zero means, equal variances, and zero correlations with the xi. ↩︎

  2. Thanks to Shakked for pointing this out. ↩︎