Suppose we have data generated by the process where the are random errors with zero means, equal variances, and zero correlations with the . This data generating process (DGP) satisfies the Gauss-Markov assumptions, so we can obtain an unbiased estimate of the coefficient using ordinary least squares (OLS).
Now suppose we restrict our data to observations with or . How will these restrictions change ?
To investigate, let’s create some toy data:
library(dplyr)
n <- 100
set.seed(0)
df <- tibble(x = rnorm(n), u = rnorm(n), y = x + u)
Here and are standard normal random variables, and for each observation . Thus . The OLS estimate of is where and are data vectors, is the covariance operator, and is the variance operator. For these data, we have
cov(df$x, df$y) / var(df$x)
## [1] 1.138795
as our estimate with no selection.
Next, let’s introduce our selection criteria:
df <- df %>%
tidyr::crossing(criterion = c('x >= 0', 'y >= 0')) %>%
rowwise() %>% # eval is annoying to vectorise
mutate(selected = eval(parse(text = criterion))) %>%
ungroup()
df
## # A tibble: 200 x 5
## x u y criterion selected
## <dbl> <dbl> <dbl> <chr> <lgl>
## 1 -2.22 -0.0125 -2.24 x >= 0 FALSE
## 2 -2.22 -0.0125 -2.24 y >= 0 FALSE
## 3 -1.56 -1.12 -2.68 x >= 0 FALSE
## 4 -1.56 -1.12 -2.68 y >= 0 FALSE
## 5 -1.54 0.577 -0.963 x >= 0 FALSE
## 6 -1.54 0.577 -0.963 y >= 0 FALSE
## 7 -1.44 -1.39 -2.83 x >= 0 FALSE
## 8 -1.44 -1.39 -2.83 y >= 0 FALSE
## 9 -1.43 -0.543 -1.97 x >= 0 FALSE
## 10 -1.43 -0.543 -1.97 y >= 0 FALSE
## # … with 190 more rows
Now df
contains two copies of each observation—one for each selection criterion—and an indicator for whether the observation is selected by each criterion.
We can use df
to estimate OLS coefficients and their standard errors among observations with and :
df %>%
filter(selected) %>%
group_by(criterion) %>%
summarise(n = n(),
estimate = cov(x, y) / var(x),
std.error = sd(y - estimate * x) / sqrt(n))
## # A tibble: 2 x 4
## criterion n estimate std.error
## <chr> <int> <dbl> <dbl>
## 1 x >= 0 48 1.02 0.136
## 2 y >= 0 47 0.356 0.110
The OLS estimate among observations with approximates the true value well.
However, the estimate among observations with is much smaller than one.
We can confirm this visually:
What’s going on? Why do we get biased OLS estimates of among observations with but not among observations with ?
The key is to think about the errors in each case. Since the and are independent, selecting observations with leaves the distributions of the unchanged—they still have zero means, equal variances, and zero correlations with the . Thus, the Gauss-Markov assumptions still hold and we still obtain unbiased OLS estimates of .
In contrast, the and are negatively correlated among observations with . To see why, notice that if then if and only if . So if is low then must be high (and vice versa) for the observation to be selected. Thus, among selected observations, we have where indexes (and, in this case, equals) the correlation between the and , and where the residuals are uncorrelated with the . Our DGP then becomes The have equal variances (equal to in this case) and, again, are uncorrelated with the . Therefore, the OLS estimate of is unbiased1, and for our toy data equals among observations with . Subtracting from then gives recovering the true value .
The table below reports 95% confidence intervals for , , and , estimated by simulating the DGP described above 100 times. The table confirms that the OLS estimate of is unbiased among observations with but biased negatively among observations with .
Observations | |||
---|---|---|---|
All | 1.005 ± 0.002 | 0.005 ± 0.002 | 1.000 ± 0.000 |
With | 1.001 ± 0.004 | 0.001 ± 0.004 | 1.000 ± 0.000 |
With | 0.547 ± 0.003 | -0.453 ± 0.003 | 1.000 ± 0.000 |
The estimate always differs from by , which is significantly non-zero among observations with . However, this pattern is not useful empirically because we generally don’t observe the and so can’t estimate to back out the true value of . Instead, we may use the Heckman correction to adjust for the bias introduced through non-random selection.
In empirical settings, selecting observations with may lead to biased estimates when (i) there is heterogeneity in the relationship between and across observations , and (ii) OLS is used to estimate an average treatment effect.2 In particular, if the are correlated with the observation-specific treatment effects then restricting to observations with changes the distribution, and hence the mean, of those effects non-randomly.