nberwp is now on CRAN

nberwp, an R package providing information on NBER working papers and their authors, is now available on CRAN. The current version (1.0.0) covers 29,434 papers published between June 1973 and June 2021. It can be installed via

install.packages('nberwp')

nberwp has evolved since its initial release on GitHub nearly two years ago. This post describes some of the main changes.

More papers

The first version of nberwp covered papers published between June 1973 and December 2018. The updated version adds papers published between January 2019 and June 2021, allowing one to visualize the spike in publications when COVID-19 emerged:

library(dplyr)
library(ggplot2)
library(nberwp)

papers %>%
  count(Quarter = year + (ceiling(month / 3) - 1) / 4, name = 'New papers') %>%
  ggplot(aes(Quarter, `New papers`)) +
  geom_line() +
  labs(title = 'COVID-19 induced a spike in NBER publications',
       subtitle = 'New NBER working papers, by quarter')

nberwp now also includes papers published in the historical and technical working paper series. The historical series contains 136 papers focused on (American) economic history, and the technical series contains 337 papers focused on analytical and empirical methods.

The working paper data exclude duplicates (e.g., papers published in multiple series) but include revisions, which capture continued development of (and collaboration on) research ideas that I believe should be acknowledged.

Program affiliations

The NBER organizes its research into programs, each of which “corresponds loosely to a traditional field of study within economics.” nberwp now provides a table of paper-program correspondences

paper_programs

## # A tibble: 53,996 x 2
##    paper program
##    <chr> <chr>  
##  1 w0074 EFG    
##  2 w0087 IFM    
##  3 w0087 ITI    
##  4 w0107 PE     
##  5 w0116 PE     
##  6 w0117 LS     
##  7 w0129 HE     
##  8 w0131 IFM    
##  9 w0131 ITI    
## 10 w0134 HE     
## # … with 53,986 more rows

as well as a table of program descriptions:

programs

## # A tibble: 21 x 3
##    program program_desc                        program_category   
##    <chr>   <chr>                               <chr>              
##  1 AG      Economics of Aging                  Micro              
##  2 AP      Asset Pricing                       Finance            
##  3 CF      Corporate Finance                   Finance            
##  4 CH      Children                            Micro              
##  5 DAE     Development of the American Economy Micro              
##  6 DEV     Development Economics               Micro              
##  7 ED      Economics of Education              Micro              
##  8 EEE     Environment and Energy Economics    Micro              
##  9 EFG     Economic Fluctuations and Growth    Macro/International
## 10 HC      Health Care                         Micro              
## # … with 11 more rows

The program_category column categorizes programs similarly to Chari and Goldsmith-Pinkham (2017). On average, each paper is affiliated with 1.83 programs and each program has 2,571 affiliated papers.

One use of the paper-program correspondences is to analyze the intellectual overlaps among programs. For example, the table below presents the six pairs of programs with the most-overlapping sets of affiliated papers, with overlap sizes measured by Jaccard indices. The top index of 0.29 means that about 29% of the papers affiliated with the Children or Economics of Education programs are affiliated with both.

Program 1	Program 2	Jaccard index
Children	Economics of Education	0.29
Health Care	Health Economics	0.29
International Finance and Macroeconomics	International Trade and Investment	0.26
Economic Fluctuations and Growth	Monetary Economics	0.23
Asset Pricing	Corporate Finance	0.17
Labor Studies	Public Economics	0.15

Authorships

nberwp now contains information about working papers’ (co-)authors:

authors

## # A tibble: 15,437 x 4
##    author  name             user_nber        user_repec
##    <chr>   <chr>            <chr>            <chr>     
##  1 w0001.1 Finis Welch      finis_welch      <NA>      
##  2 w0002.1 Barry R Chiswick barry_chiswick   pch425    
##  3 w0003.1 Swarnjit S Arora swarnjit_arora   <NA>      
##  4 w0004.1 Lee A Lillard    <NA>             pli669    
##  5 w0005.1 James P Smith    james_smith      psm28     
##  6 w0006.1 Victor Zarnowitz victor_zarnowitz <NA>      
##  7 w0007.1 Lewis C Solmon   <NA>             <NA>      
##  8 w0008.1 Merle Yahr Weiss <NA>             <NA>      
##  9 w0008.2 Robert E Lipsey  robert_lipsey    pli259    
## 10 w0010.1 Paul W Holland   <NA>             <NA>      
## # … with 15,427 more rows

The author column contains unique author identifiers, constructed by concatenating each author’s debut paper and their position on that paper’s (alphabetized) byline. This construction ensures that author values do not change when I add newly published papers to the data. The user_nber column contains authors’ usernames on the NBER website; the user_repec column contains authors’ RePEc IDs. Some authors do not have an NBER username or RePEc ID, indicated by NA values in the appropriate column.

nberwp also provides a table of paper-author correspondences:

paper_authors

## # A tibble: 67,090 x 2
##    paper author 
##    <chr> <chr>  
##  1 w0001 w0001.1
##  2 w0002 w0002.1
##  3 w0003 w0003.1
##  4 w0004 w0004.1
##  5 w0005 w0005.1
##  6 w0006 w0006.1
##  7 w0007 w0007.1
##  8 w0008 w0008.1
##  9 w0008 w0008.2
## 10 w0009 w0004.1
## # … with 67,080 more rows

This table can be used to construct a co-authorship network among the 15,437 authors identified in nberwp. This network currently contains 38,968 edges, implying that 0.03% of pairs co-authored at least one working paper during the period covered by the data. Authors in the network have a mean degree of 5.05.

I used previous versions of nberwp in blog posts on triadic closure and female representation. These posts assumed that authors were uniquely identified by their full names. This assumption was problematic: different authors could share the same name, or a single author could publish under many names (e.g., before and after marriage). The updated version of nberwp builds on previous efforts to disambiguate authors’ names—namely cross-referencing against NBER usernames, RePEc IDs, common co-authorships, and name edit distances—in three ways:

using paper-program correspondences to identify authors who have similar names and published papers in similar programs, and so are likely to be the same person;
manually merging (or splitting) authors whom I determine to be the same (or distinct) based on their personal or academic websites;
including an author ID variable (author) rather than relying on names for unique identification.

These enhancements support cleaner analyses of (co-)authorship behavior. Nonetheless the data may still contain errors—if you find any, let me know by adding an issue on GitHub.