Today I updated the motuwp GitHub repository, which stores data on Motu working papers and their authors. I made three main changes:
First, I switched from BeautifulSoup to rvest for scraping the working paper directory.
My original Python script used a bunch of regex commands to build the list of working paper URLs, despite warnings that regular expressions and HTML generally don’t cooperate.
I should have just used CSS selectors, which I now do using data.R
.
Second, I implemented a caching mechanism for passing information between runs of data.R
.
The script queries only papers released since the last run, so adding new papers is faster and requires fewer HTTP requests.
Third, I added working paper titles to the information collected. This allows me to, for example, use tf-idf scores to characterise research areas: