Introducing nberwp

Today I published nberwp, an R package providing data on NBER working papers published between 1973 and 2018. It can be installed from GitHub via remotes:

library(remotes)
install_github('bldavies/nberwp')

nberwp provides a data frame papers, each row describing a unique working paper:

papers

## # A tibble: 25,413 x 4
##    number  year month title                                                     
##     <int> <int> <int> <chr>                                                     
##  1      1  1973     6 Education, Information, and Efficiency                    
##  2      2  1973     6 Hospital Utilization: An Analysis of SMSA Differences in …
##  3      3  1973     6 Error Components Regression Models and Their Applications 
##  4      4  1973     7 Human Capital Life Cycle of Earnings Models: A Specific S…
##  5      5  1973     7 A Life Cycle Family Model                                 
##  6      6  1973     7 A Review of Cyclical Indicators for the United States: Pr…
##  7      7  1973     8 The Definition and Impact of College Quality              
##  8      8  1973     9 Multinational Firms and the Factor Intensity of Trade     
##  9      9  1973     9 From Age-Earnings Profiles to the Distribution of Earning…
## 10     10  1973     9 Monte Carlo for Robust Regression: The Swindle Unmasked   
## # … with 25,403 more rows

number uniquely identifies working papers by their positions in the series, while year and month capture papers’ publication dates. The chart below uses these dates to show the NBER catalogue’s expansion.

title facilitates simple text mining, such as determining which words are used in working paper titles most frequently:

library(tidytext)

words <- papers %>%
  unnest_tokens(word, title) %>%
  anti_join(get_stopwords()) %>%
  filter(nchar(gsub('[a-z.]', '', word)) == 0) %>%
  distinct(number, word)

words %>%
  count(word, sort = T)

## # A tibble: 11,636 x 2
##    word         n
##    <chr>    <int>
##  1 evidence  2615
##  2 policy    1350
##  3 market    1322
##  4 effects   1193
##  5 trade     1052
##  6 capital    979
##  7 labor      940
##  8 economic   910
##  9 u.s        882
## 10 health     875
## # … with 11,626 more rows

Many papers discuss capital and labour markets, and the effects of public policies. The word “evidence” appears in twice as many titles as any other (non-stop) word, which I suspect reflects the growing use of the “<Issue>: Evidence from <context>” title format:

The NBER’s RePEc index, from which I derive papers, also contains data linking papers to their authors. I plan to include these data in a future version of nberwp once I’ve disambiguated authors’ names.