Suppose we have data {(xi,yi):i∈{1,2,…,n}}{(xi,yi):i∈{1,2,…,n}} generated by the process yi=βxi+ui,yi=βxi+ui, where the uiui are random errors with zero means, equal variances, and zero correlations with the xixi. This data generating process (DGP) satisfies the Gauss-Markov assumptions, so we can obtain an unbiased estimate ˆβ^β of the coefficient ββ using ordinary least squares (OLS).
Now suppose we restrict our data to observations with xi≥0xi≥0 or yi≥0yi≥0. How will these restrictions change ˆβ^β?
To investigate, let’s create some toy data:
library(dplyr)
n <- 100
set.seed(0)
df <- tibble(x = rnorm(n), u = rnorm(n), y = x + u)
Here xixi and uiui are standard normal random variables, and yi=xi+uiyi=xi+ui for each observation i∈{1,2,…,100}i∈{1,2,…,100}. Thus β=1β=1. The OLS estimate of ββ is ˆβ=Cov(x,y)Var(x),^β=Cov(x,y)Var(x), where x=(x1,x2,…,x100)x=(x1,x2,…,x100) and y=(y1,y2,…,y100)y=(y1,y2,…,y100) are data vectors, CovCov is the covariance operator, and VarVar is the variance operator. For these data, we have
cov(df$x, df$y) / var(df$x)
## [1] 1.138795
as our estimate with no selection.
Next, let’s introduce our selection criteria:
df <- df %>%
tidyr::crossing(criterion = c('x >= 0', 'y >= 0')) %>%
rowwise() %>% # eval is annoying to vectorise
mutate(selected = eval(parse(text = criterion))) %>%
ungroup()
df
## # A tibble: 200 x 5
## x u y criterion selected
## <dbl> <dbl> <dbl> <chr> <lgl>
## 1 -2.22 -0.0125 -2.24 x >= 0 FALSE
## 2 -2.22 -0.0125 -2.24 y >= 0 FALSE
## 3 -1.56 -1.12 -2.68 x >= 0 FALSE
## 4 -1.56 -1.12 -2.68 y >= 0 FALSE
## 5 -1.54 0.577 -0.963 x >= 0 FALSE
## 6 -1.54 0.577 -0.963 y >= 0 FALSE
## 7 -1.44 -1.39 -2.83 x >= 0 FALSE
## 8 -1.44 -1.39 -2.83 y >= 0 FALSE
## 9 -1.43 -0.543 -1.97 x >= 0 FALSE
## 10 -1.43 -0.543 -1.97 y >= 0 FALSE
## # … with 190 more rows
Now df
contains two copies of each observation—one for each selection criterion—and an indicator for whether the observation is selected by each criterion.
We can use df
to estimate OLS coefficients and their standard errors among observations with xi≥0xi≥0 and yi≥0yi≥0:
df %>%
filter(selected) %>%
group_by(criterion) %>%
summarise(n = n(),
estimate = cov(x, y) / var(x),
std.error = sd(y - estimate * x) / sqrt(n))
## # A tibble: 2 x 4
## criterion n estimate std.error
## <chr> <int> <dbl> <dbl>
## 1 x >= 0 48 1.02 0.136
## 2 y >= 0 47 0.356 0.110
The OLS estimate among observations with xi≥0xi≥0 approximates the true value β=1β=1 well.
However, the estimate among observations with yi≥0yi≥0 is much smaller than one.
We can confirm this visually:
What’s going on? Why do we get biased OLS estimates of ββ among observations with yi≥0yi≥0 but not among observations with xi≥0xi≥0?
The key is to think about the errors uiui in each case. Since the xixi and uiui are independent, selecting observations with xi≥0xi≥0 leaves the distributions of the uiui unchanged—they still have zero means, equal variances, and zero correlations with the xixi. Thus, the Gauss-Markov assumptions still hold and we still obtain unbiased OLS estimates of ββ.
In contrast, the xixi and uiui are negatively correlated among observations with yi≥0yi≥0. To see why, notice that if yi=xi+uiyi=xi+ui then yi≥0yi≥0 if and only if xi≥−uixi≥−ui. So if xixi is low then uiui must be high (and vice versa) for the observation to be selected. Thus, among selected observations, we have ui=ρxi+εi,ui=ρxi+εi, where ρ<0ρ<0 indexes (and, in this case, equals) the correlation between the xixi and uiui, and where the residuals εiεi are uncorrelated with the xixi. Our DGP then becomes yi=(β+ρ)xi+εi.yi=(β+ρ)xi+εi. The εiεi have equal variances (equal to 1+ρ21+ρ2 in this case) and, again, are uncorrelated with the xixi. Therefore, the OLS estimate ˆρ=Cov(u,x)Var(x)^ρ=Cov(u,x)Var(x) of ρρ is unbiased1, and for our toy data equals ˆρ≈−0.644^ρ≈−0.644 among observations with yi≥0yi≥0. Subtracting ˆρ^ρ from ˆβ^β then gives ˆβ−ˆρ≈0.356−(−0.644)=1, recovering the true value β=1.
The table below reports 95% confidence intervals for ˆβ, ˆρ, and (ˆβ−ˆρ), estimated by simulating the DGP yi=xi+ui described above 100 times. The table confirms that the OLS estimate ˆβ of β=1 is unbiased among observations with xi≥0 but biased negatively among observations with yi≥0.
Observations | ˆβ | ˆρ | ˆβ−ˆρ |
---|---|---|---|
All | 1.005 ± 0.002 | 0.005 ± 0.002 | 1.000 ± 0.000 |
With xi≥0 | 1.001 ± 0.004 | 0.001 ± 0.004 | 1.000 ± 0.000 |
With yi≥0 | 0.547 ± 0.003 | -0.453 ± 0.003 | 1.000 ± 0.000 |
The estimate ˆβ always differs from β by ˆρ, which is significantly non-zero among observations with yi≥0. However, this pattern is not useful empirically because we generally don’t observe the ui and so can’t estimate ˆρ to back out the true value of β=ˆβ−ˆρ. Instead, we may use the Heckman correction to adjust for the bias introduced through non-random selection.
In empirical settings, selecting observations with xi≥0 may lead to biased estimates when (i) there is heterogeneity in the relationship between yi and xi across observations i, and (ii) OLS is used to estimate an average treatment effect.2 In particular, if the xi are correlated with the observation-specific treatment effects then restricting to observations with xi≥0 changes the distribution, and hence the mean, of those effects non-randomly.