Today I updated the motuwp GitHub repository, which stores data on Motu working papers and their authors. I made three main changes:

First, I switched from BeautifulSoup to rvest for scraping the working paper directory. My original Python script used a bunch of regex commands to build the list of working paper URLs, despite warnings that regular expressions and HTML generally don’t cooperate. I should have just used CSS selectors, which I now do using data.R.

Second, I implemented a caching mechanism for passing information between runs of data.R. The script queries only papers released since the last run, so adding new papers is faster and requires fewer HTTP requests.

Third, I added working paper titles to the information collected. This allows me to, for example, use tf-idf scores to characterise research areas: