Ben Davies

Armchair Expert episodes

Fri, 26 Jul 2024 00:00:00 +0000

Armchair Expert is a podcast hosted by Dax Shepard and Monica Padman. They interview celebrities, scientists, and other public figures. They also publish some subsidiary “shows” via the podcast’s main feed. The table below counts episodes by show:

Show	Episodes
Interviews	655
Armchair Anonymous	98
Flightless Bird	94
Synced	52
Armchaired & Dangerous	21
Race to 35	12
We Are Supported By…	12
Monica & Jess Love Boys	11
Race to 270	11
Yearbook	8
Total	974

I store episodes’ metadata in the R package ArmchairExpert. It contains a single table, episodes, with a row for each episode and seven columns:

id: Episode ID on Spotify
date: Episode release date
title: Episode title
show: Show to which episode belongs
number: Within-show episode number
duration: Episode length (in seconds)
description: Episode description

The first Armchair Expert episode—an interview with Dax’s wife, Kristen Bell—was released in February 2018. The earliest show (Monica & Jess Love Boys) started in February 2020 and ended two months later. Other shows have been and gone, and three are ongoing:

My favorite show is Flightless Bird. It’s hosted by David Farrier, a fellow Kiwi who reflects on living in the USA.

The median episode is about 93 minutes long. But most interviews are longer and most shows are shorter:

Most interviews end with a “fact check,” during which Dax and Monica discuss the interview and their lives. Fact checks can be as long as the interviews themselves.

Dax and Monica have interviewed some people many times. They’ve interviewed Kristen Bell five times, and David Sedaris and Sanjay Gupta four times each.¹ My favorite interviews are with Esther Perel, Wendy Mogel, and Terry Crews.

Adam Grant, Ashton Kutcher, Jason Bateman, Vincent D’Onofrio, and Yuval Noah Harari have been interviewed three times each. ↩︎

Decomposing matrices of pairwise minima

Sun, 05 May 2024 00:00:00 +0000

Let $A$ be the $n\times n$ matrix with ${ij}^\text{th}$ entry $A_{ij}=\min\{i,j\}$. From a previous post, we know $A$ has a tridiagonal inverse $A^{-1}$ with ${ij}^\text{th}$ entry¹ $$\left[A^{-1}\right]_{ij}=\begin{cases} 2 & \text{if}\ i=j<n \\ 1 & \text{if}\ i=j=n \\ -1 & \text{if}\ \lvert i-j\rvert=1 \\ 0 & \text{otherwise}. \end{cases}$$ For example, if $n=4$ then $$A=\begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 \\ 1 & 2 & 3 & 3 \\ 1 & 2 & 3 & 4 \end{bmatrix}$$ has inverse $$A^{-1}=\begin{bmatrix} 2 & -1 & 0 & 0 \\ -1 & 2 & -1 & 0 \\ 0 & -1 & 2 & -1 \\ 0 & 0 & -1 & 1 \end{bmatrix}$$

We can use our knowledge of $A^{-1}$ to eigendecompose $A$. To see how, let $\{(\lambda_j,v_j)\}_{j=1}^n$ be the eigenpairs of $A^{-1}$. Yueh (2005, Theorem 1) shows that the eigenvector $v_j\in\mathbb{R}^n$ corresponding to the $j^\text{th}$ eigenvalue $$\lambda_j=2\left(1+\cos\left(\frac{2j\pi}{2n+1}\right)\right)$$ has $i^\text{th}$ component $$[v_j]_i=\alpha\sin\left(\frac{2ij\pi}{2n+1}\right),$$ where $\alpha\in\mathbb{R}$ is an arbitrary scalar. This vector has length $$\begin{align} \lvert\vert v_j\rvert\rvert &\equiv \sqrt{\sum_{i=1}^n\left([v_j]_i\right)^2} \\ &= \sqrt{\sum_{i=1}^n\alpha^2\sin^2\left(\frac{2ij\pi}{2n+1}\right)} \\ &= \lvert\alpha\rvert\sqrt{\frac{2n+1}{4}}, \end{align}$$ where the last equality can be verified using Wolfram Alpha and proved using complex analysis. So choosing $\alpha=2/\sqrt{2n+1}$ ensures that the eigenvectors $v_1,v_2,\ldots,v_n$ of $A^{-1}$ have unit length. Then, by the spectral theorem, these vectors form an orthonormal basis for $\mathbb{R}^n$. As a result, the $n\times n$ matrix $$V=\begin{bmatrix} v_1 & v_2 & \cdots & v_n\end{bmatrix}$$ with ${ij}^\text{th}$ entry $V_{ij}=[v_j]_i$ is orthogonal. Moreover, letting $\Lambda$ be the $n\times n$ diagonal matrix with ${ii}^\text{th}$ entry $\Lambda_{ii}=\lambda_i$ yields the eigendecomposition $$\begin{align} A^{-1} &= V\Lambda V^T \\ &= \sum_{j=1}^n\lambda_jv_jv_j^T \end{align}$$ of $A^{-1}$. It follows from the orthogonality of $V$ that $$\begin{align} A &= \left(V\Lambda V^T\right)^{-1} \\ &= V\Lambda^{-1} V^T \\ &= \sum_{j=1}^n\frac{1}{\lambda_j}v_jv_j^T \end{align}$$ is the eigendecomposition of $A$. Thus $A$ and $A^{-1}$ have the same eigenvectors, but the eigenvalues of $A$ are the reciprocated eigenvalues of $A^{-1}$.

Here’s one scenario in which this decomposition is useful: Suppose I observe data $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$ generated by the process $$\DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{Var} \newcommand{\veps}{\sigma_\epsilon^2} \newcommand{\R}{\mathbb{R}} \renewcommand{\epsilon}{\varepsilon} \begin{align} y_i &= f(x_i)+\epsilon_i \\ \epsilon_i &\overset{\text{iid}}{\sim} \mathcal{N}(0,\veps), \end{align}$$ where $\{f(x)\}_{x\ge0}$ is a sample path of a standard Wiener process and where the errors $\epsilon_i$ are iid normally distributed with variance $\veps$. I use these data to estimate $f(x)$ for some $x\ge0$.² My estimator $\hat{f}(x)\equiv\E[f(x)\mid\mathcal{D}]$ has conditional variance $$\Var\left(\hat{f}(x)\mid\mathcal{D}\right)=\Var(f(x))-w^T\Sigma^{-1} w,$$ where $w\in\R^n$ is the vector with $i^\text{th}$ component $w_i=\Cov(y_i,f(x))$ and where $\Sigma\in\R^{n\times n}$ is the covariance matrix with ${ij}^\text{th}$ entry $\Sigma_{ij}=\Cov(y_i,y_j)$. If $x_i=i$ for each $i\in\{1,2,\ldots,n\}$, then we can express this matrix as the sum $$\Sigma=A+\veps I,$$ where $A$ is the $n\times n$ matrix defined above and where $I$ is the $n\times n$ identity matrix. But we know $A=V\Lambda^{-1}V^T$. We also know $I=VV^T$, since $V$ is orthogonal. It follows that $$\begin{align*} \Sigma^{-1} &= \left(V\Lambda^{-1}V^T+\veps VV^T\right)^{-1} \\ &= V\left(\Lambda+\frac{1}{\veps}I\right)V^T, \end{align*}$$ from which we can derive a (relatively) closed-form expression for the conditional variance of $\hat{f}(x)$ given $\mathcal{D}$.

One can verify this claim by showing $AA^{-1}$ equals the identity matrix. ↩︎
I discuss this estimation problem in a recent paper. ↩︎

Delayed saving

Wed, 01 May 2024 00:00:00 +0000

Suppose I want to retire at time $T>0$. I make constant payments to a savings account that earns continuously compounded interest $r>0$. I want my retirement fund to be worth $V>0$ today (time $0$). How much bigger do my payments have to be if I delay them?

Let $X_d$ be the payments I have to make if I start saving at time $d\in[0,T]$. These payments form an annuity with value $$\frac{X_d}{r}\left(1-e^{-r(T-d)}\right)$$ at time $d$. I want this value to equal $Ve^{rd}$. So my payments must equal $$\begin{align} X_d &= \frac{r}{1-e^{-r(T-d)}}\times Ve^{rd} \\ &= \frac{rV}{e^{-rd}-e^{-rT}}. \end{align}$$ Therefore, delaying to time $d$ increases my payments by a factor of $$\frac{X_d}{X_0}=\frac{1-e^{-rT}}{e^{-rd}-e^{-rT}}.$$ The chart below shows how $X_d/X_0$ grows with the proportion of time $d/T$ I delay saving. Part of this growth comes from having less time remaining: if my savings earn no interest, then the factor $$\lim_{r\to0}\frac{X_d}{X_0}=\frac{T}{T-d}$$ equals the ratio of time until retirement and time spent saving. Raising $r$ raises $X_d/X_0$ because I forgo more opportunities to earn interest on my interest the longer I delay. This is especially true when I’m far from retiring (i.e., $T$ is large).

Thanks to Michael Boskin for inspiring this post.

Learning about a changing state

Mon, 08 Jan 2024 00:00:00 +0000

I have a new paper on Bayesian learning. It extends my model of paying for precision to a setting where the unknown state changes over time. This makes the agent keep buying new information as his existing information becomes out of date. I show how his demand for information depends on whether he is myopic or forward-looking, and on the Gaussian process defining how the state evolves.

The paper stems from my research with Anirudh Sankar on how people learn across contexts. Suppose I ask you for advice, and you say “X worked for me.” But will X work for me? We’re different people with different contexts (e.g., physical and social positions). Our outcomes might be different.

Imagine there’s a function mapping contexts to outcomes. If I know this function then I can invert it, taking information generated in your context and porting it into mine. But if I don’t know the function then I can’t invert it, which makes learning from you hard. Anirudh and my research formalizes this idea: the more I know about the function mapping contexts to outcomes, the easier it is to learn across contexts.

Mathematically, learning across contexts is like learning across time: the function mapping contexts to outcomes is like a stochastic process mapping times to states. But contexts, unlike time, can have many dimensions and may not be totally orderable. Contexts are more general, and so models of learning across them can lead to more general insights. I hope to share some of those insights in the future.

Learning from correlated signals

Fri, 24 Nov 2023 00:00:00 +0000

Suppose I want to learn the value of a parameter $\theta\in\mathbb{R}$. My prior is that $\theta$ is normally distributed with variance $\sigma_0^2$. I observe $n\ge1$ signals $$\DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{Var} \newcommand{\R}{\mathbb{R}} \renewcommand{\epsilon}{\varepsilon} s_i=\theta+\epsilon_i$$ of $\theta$. The errors $\epsilon_i$ in these signals are independent of $\theta$. They are jointly normally distributed with equal variances $\Var(\epsilon_i)=\sigma^2$ and pairwise correlations $$\Cor(\epsilon_i,\epsilon_j)=\begin{cases} 1 & \text{if}\ i=j \\ \rho & \text{otherwise}. \end{cases}$$ I assume $-1/(n-1)\le\rho\le1$ so that this distribution is feasible.¹

Observing $s_1,s_2,\ldots,s_n$ is the same to observing the sample mean $$\bar{s}_n\equiv\frac{1}{n}\sum_{i=1}^ns_i,$$ which is normally distributed and has conditional variance $$\Var(\bar{s}_n\mid\theta)=\frac{(1+(n-1))\rho\sigma^2}{n}$$ under my prior. The posterior distribution of $\theta$ given $\bar{s}_n$ is also normal and has variance $$\Var(\theta\mid\bar{s}_n)=\left(\frac{1}{\sigma_0^2}+\frac{n}{(1+(n-1)\rho)\sigma^2}\right)^{-1}.$$ Both variances are (i) decreasing in $n$ when $\rho<1$ and (ii) increasing in $\rho$ when $n>1$. Intuitively, if the signals are not perfectly correlated then observing more gives me more information about $\theta$. If they are negatively correlated then their errors “cancel out” and the sample mean $\bar{s}_n$ gives me a precise estimate of $\theta$.

The chart below shows how $\Var(\bar{s}_n\mid\theta)$ and $\Var(\theta\mid\bar{s}_n)$ vary with $\rho$ and $n$ when $\sigma_0=\sigma=1$. If $\rho=-1/(n-1)$ then $\epsilon_1+\epsilon_2+\cdots+\epsilon_n=0$, and so $\Var(\bar{s}_n\mid\theta)=0$ and $\Var(\theta\mid\bar{s}_n)=0$ because $\bar{s}_n=\theta$. Whereas if $\rho=1$ then signals $s_2$ through $s_n$ provide the same information as $s_1$, and so $\Var(\bar{s}_n\mid\theta)=\Var(s_1\mid\theta)$ and $\Var(\theta\mid\bar{s}_n)=\Var(\theta\mid s_1)$.

For example, it is impossible for three normal variables to have equal variances and pairwise correlations of $-1$. See here for an explanation. ↩︎

Simulating Wiener and Ornstein-Uhlenbeck processes

Tue, 29 Aug 2023 00:00:00 +0000

A (standard) Wiener process is a continuous-time stochastic process $\{W(t)\}_{t\ge0}$ with initial value $W(0)=0$ and instantaneous increments $$\newcommand{\der}{\mathrm{d}} \der W(t)\sim N(0,\der t).$$ We can simulate such a process as follows. First, create a sequence of times $t$ at which to store the value of $W(t)$:

t_max = 100
dt = 1e-2

t = seq(0, t_max, by = dt)

Increasing t_max creates a longer path, while decreasing dt creates a smoother path. Now simulate the random increments and take their cumulative sum:

dW = rnorm(length(t) - 1, mean = 0, sd = sqrt(dt))
W = c(0, cumsum(dW))

Here are three sample paths generated by this procedure:

We can use $\{W(t)\}_{t\ge0}$ to construct an Ornstein-Uhlenbeck process $\{X(t)\}_{t\ge0}$. This process has instantaneous increments $$\der X(t)=-\theta X(t)\der t+\der W(t),$$ where $\theta\ge0$ controls the process’ tendency to mean-revert. We can compute its values $X(t)$ by iterating over dW:

theta = 1

X = rep(0, length(dW))
i = 1
while (i < length(dW)) {
  X[i + 1] = X[i] - theta * X[i] * dt + dW[i + 1]
  i = i + 1
}

The chart below compares the sample paths obtained using different $\theta$ values. Each path uses the same realization of the underlying Wiener process $\{W(t)\}_{t\ge0}$. If $\theta=0$ then $X(t)=W(t)$ for all $t\ge0$. The mean magnitude of $X(t)$ falls as $\theta$ rises because this makes the process more mean-reverting.

Inverting matrices of pairwise minima

Sun, 20 Aug 2023 00:00:00 +0000

Let $0<x_1<x_2<\ldots<x_n$ and let $A$ be the symmetric $n\times n$ matrix with ${ij}^\text{th}$ entry $A_{ij}=\min\{x_i,x_j\}$.¹ This matrix has linearly independent columns and so is invertible. Its inverse $A^{-1}$ is symmetric, tridiagonal, and has ${ij}^\text{th}$ entry $$[A^{-1}]_{ij}=\begin{cases} \frac{1}{x_1}+\frac{1}{x_2-x_1} & \text{if}\ i=j=1 \\ \frac{1}{x_i-x_{i-1}}+\frac{1}{x_{i+1}-x_i} & \text{if}\ 1<i=j<n \\ \frac{1}{x_n-x_{n-1}} & \text{if}\ i=j=n \\ -\frac{1}{x_j-x_i} & \text{if}\ i=j-1 \\ -\frac{1}{x_i-x_j} & \text{if}\ i=j+1 \\ 0 & \text{otherwise}. \end{cases}$$ For example, if $x_i=2^{i-1}$ for each $i\le n=5$ then $$A=\begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 1 & 2 & 2 & 2 & 2 \\ 1 & 2 & 4 & 4 & 4 \\ 1 & 2 & 4 & 8 & 8 \\ 1 & 2 & 4 & 8 & 16 \\ \end{bmatrix}$$ and $$A^{-1}=\begin{bmatrix} 2 & -1 & 0 & 0 & 0 \\ -1 & 1.5 & -0.5 & 0 & 0 \\ 0 & -0.5 & 0.75 & -0.25 & 0 \\ 0 & 0 & -0.25 & 0.375 & -0.125 \\ 0 & 0 & 0 & -0.125 & 0.125 \\ \end{bmatrix}$$ You may wonder: why is this useful? Suppose I observe data $\{(x_i,y_i)\}_{i=1}^n$, where the function $f:[0,\infty)\to\mathbb{R}$ mapping regressors $x_i\ge0$ to outcomes $y_i=f(x_i)$ is the realization of a Wiener process. I use these data to estimate some value $f(x)$ via Bayesian regression. My estimate depends on the inverse of the covariance matrix for the outcome vector $y=(y_1,y_2,\ldots,y_n)$. This matrix has ${ij}^\text{th}$ entry $\min\{x_i,x_j\}$, so I can compute its inverse using the expression above.

Let me know if the family of such matrices has a name! ↩︎

Correlation and concordance

Thu, 03 Aug 2023 00:00:00 +0000

Let $X=(X_1,X_2)$ be a random vector in $\mathbb{R}^2$. Two realizations $x$ and $x'$ of $X$ form a concordant pair if $(x_2'-x_2)$ and $(x_1'-x_1)$ have the same sign. What’s the probability of sampling a concordant pair when $X$ is bivariate normal?

For example, suppose $X_1$ and $X_2$ have zero means, unit variances, and a correlation of $\rho$. The scatter plots below show 100 realizations of $(X_1,X_2)$ when $\rho\in\{-0.5,0,0.5\}$. These realizations contain $$\binom{100}{2}=4,\!950$$ pairs, of which 36% are concordant when $\rho=-0.5$. This percentage rises to 48% when $\rho=0$ and to 71% when $\rho=0.5$. Increasing $\rho$ makes concordance more likely because it makes $(X_2-X_1)$ larger and less noisy.

Different samples give different concordance rates due to sampling variation. We can remove this variation by deriving the concordance rate analytically. To begin, suppose $X$ has mean $\mathrm{E}[X]=(\mu_1,\mu_2)$ and covariance matrix $$\mathrm{Var}(X)=\begin{bmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{bmatrix}.$$ Then $X_2\mid X_1$ is normal with mean $$\mathrm{E}[X_2\mid X_1]=\mu_2+\frac{\rho\sigma_2}{\sigma_1}(X_1-\mu_1)$$ and variance $$\mathrm{Var}(X_2\mid X_1)=(1-\rho^2)\sigma_2^2.$$ So for any two realizations $x$ and $x'$ of $X$ we can write $$\renewcommand{\epsilon}{\varepsilon} x'_2-x_2=\frac{\rho\sigma_2}{\sigma_1}\left(x'_1-x_1\right)+\epsilon$$ with $\epsilon\sim N(0,2(1-\rho^2)\sigma_2^2)$. Now $x'_1-x_1\sim N(0,2\sigma_1^2)$ is normal, and so $$z\equiv \frac{x'_1-x_1}{\sigma_1\sqrt{2}}$$ is standard normal and exceeds zero if and only if $x'_1>x_1$. Letting $f$ and $\phi$ be the density functions for $\epsilon$ and $z$ then gives $$\newcommand{\der}{\mathrm{d}} \begin{align} \Pr(x'_2>x_2\ \text{and}\ x'_1>x_1) &= \Pr(\sqrt{2}\rho\sigma_2 z+\epsilon>0\ \text{and}\ z>0) \\ &= \int_0^\infty\left(\int_{-\sqrt{2}\rho\sigma_2 z}^\infty f(\epsilon)\,\der \epsilon\right)\phi(z)\,\der z \\ &\overset{\star}{=} \int_0^\infty\left(\int_{\frac{-\rho z}{\sqrt{1-\rho^2}}}^\infty \phi(w)\,\der w\right)\phi(z)\,\der z \\ &= \int_0^\infty\left(1-\Phi\left(\frac{-\rho z}{\sqrt{1-\rho^2}}\right)\right)\phi(z)\,\der z \\ &\overset{\star\star}{=} \frac{1}{2}-\int_0^\infty\Phi\left(\frac{-\rho z}{\sqrt{1-\rho^2}}\right)\phi(z)\,\der z, \end{align}$$ where $\Phi$ is the standard normal CDF, where $\star$ uses the change of variables $$w\equiv \frac{\epsilon}{\sigma_2\sqrt{2(1-\rho^2)}},$$ and where $\star\star$ uses the symmetry of $\phi$ about $z=0$. But $f$ is symmetric about $\epsilon=0$, which implies $$\Pr(x'_2>x_2\ \text{and}\ x'_1>x_1)=\Pr(x'_2<x_1\ \text{and}\ x'_1<x_1),$$ and therefore $$\begin{align} C(\rho) &\equiv \Pr(x\ \text{and}\ x'\ \text{are concordant}) \\ &= \Pr(x'_2>x_2\ \text{and}\ x'_1>x_1)+\Pr(x'_2<x_1\ \text{and}\ x'_1<x_1) \\ &= 1-2\int_0^\infty\Phi\left(\frac{-\rho z}{\sqrt{1-\rho^2}}\right)\phi(z)\,\der z. \end{align}$$ The concordance rate $C(\rho)$ depends on the correlation $\rho$ of $X_1$ and $X_2$, but not their means or variances. It has value $C(0)=0.5$ when $\rho=0$ because $\Phi(0)=0.5$ is constant. Intuitively, if $X_1$ and $X_2$ are uncorrelated then we can’t use $(x'_1-x_1)$ to predict $(x'_2-x_2)$, which is equally likely to be positive or negative. Whereas if $\lvert\rho\rvert=1$ then $(x'_1-x_1)$ predicts $(x'_2-x_2)$ perfectly, and so $$\lim_{\rho\to1}C(\rho)=1$$ and $$\lim_{\rho\to-1}C(\rho)=0.$$ The chart below verifies that the concordance rate $C(\rho)$ grows with $\rho$. It also shows that $$C(\rho)+C(1-\rho)=1.$$ Thus, for example, we have $C(-0.5)=1/3$ and $C(0.5)=2/3$. These values remove the sampling error from the estimates 0.36 and 0.71 obtained using the 100 realizations above.

The option value of waiting

Sun, 16 Jul 2023 00:00:00 +0000

This post is about waiting for information before taking an action. It uses a simple model to explain when and why waiting is valuable. It formalizes some ideas discussed in my posts on climate change and pandemic policy.

Suppose it costs $c>0$ to take an action that pays $b>c$ if it is beneficial ($\omega=1$) and zero otherwise ($\omega=0$). I take the action if its expected net benefit $$\newcommand{\E}{\mathrm{E}} \E[\omega b-c]=pb-c$$ exceeds zero, where $p=\Pr(\omega=1)$ is my prior belief about $\omega$. Thus, my decision rule is to take the action whenever $p$ exceeds the cost-benefit ratio $c/b$.

Now suppose I can wait for a noisy signal $s\in\{0,1\}$ with error rate $$\renewcommand{\epsilon}{\varepsilon} \Pr(s\not=\omega\mid \omega)=\epsilon\in[0,0.5].$$ I use my prior, the signal, and Bayes’ rule to form a posterior belief $$\begin{align} q_s &\equiv \Pr(\omega=1\mid s) \\ &= \begin{cases} \frac{\epsilon p}{(1-\epsilon)(1-p)+\epsilon p} & \text{if}\ s=0 \\ \frac{(1-\epsilon)p}{\epsilon(1-p)+(1-\epsilon)p} & \text{if}\ s=1 \end{cases} \end{align}$$ about $\omega$. Then I take the action if its expected net benefit $$\begin{align} \E[\omega b-c\mid s] &= q_sb-c \end{align}$$ given $s$ exceeds zero. This happens with probability $$\Pr(q_sb-c\ge0)=\begin{cases} 1 & \text{if}\ c/b\le q_0 \\ \Pr(s=1) & \text{if}\ q_0<c/b\le q_1 \\ 0 & \text{if}\ q_1<c/b, \end{cases}$$ where the probability $$\Pr(s=1)=\epsilon(1-p)+(1-\epsilon)p$$ of receiving a positive signal depends on my prior $p$ and the error rate $\epsilon$.

If $c/b\le q_0$ or $q_1<c/b$ then the signal doesn’t affect whether I take the action, so I don’t need to wait. But if $q_0<c/b\le q_1$ then waiting gives me a real option not to take the action if I learn it isn’t beneficial. So the expected benefit of waiting equals $$\begin{align} W &\equiv \delta\,\E\left[\E[\max\{0,q_sb-c\}\mid s]\right] \\ &= \begin{cases} \delta(pb-c) & \text{if}\ c/b\le q_0 \\ \delta(q_1b-c)\Pr(s=1) & \text{if}\ q_0<c/b\le q_1 \\ 0 & \text{if}\ q_1<c/b, \end{cases} \end{align}$$ where the discount factor $\delta\in[0,1]$ captures (i) my patience and (ii) my confidence that the action will still be available if I wait.

I should take the action before receiving $s$ if and only if the expected net benefit $(pb-c)$ under my prior exceeds $W$. This happens precisely when my prior exceeds $$p^*\equiv\frac{(1-\delta\epsilon)c}{b-\delta((1-\epsilon)b-(1-2\epsilon)c)}.$$ The following chart plots $p^*$ against $\delta$ when $c/b\in\{0.1,0.3,0.5\}$ and $\epsilon\in\{0,0.25,0.5\}$. Increasing the discount factor $\delta$ or the cost-benefit ratio $c/b$ raises the option value of waiting, which raises the threshold prior $p^*$ above which I should take the action. Increasing the error rate $\epsilon$ makes the signal less informative, which lowers the option value of waiting and, hence, lowers $p^*$. If $\epsilon=0.5$ then the signal is uninformative and so $p^*=c/b$ independently of $\delta$.

Learning in continuous time

Sat, 08 Jul 2023 00:00:00 +0000

This post describes a continuous-time model of Bayesian learning about a binary state. It complements the discrete-time models discussed in previous posts (see, e.g., here or here). I present the model, discuss its learning dynamics, and derive these dynamics analytically.

The model has been used to study decision times (Fudenberg et al., 2018), experimentation (Bolton and Harris, 1999; Moscarini and Smith, 2001), information acquisition (Morris and Strack, 2019), and persuasion (Liao, 2021). It also underlies the drift-diffusion model of reaction times used by psychologists—see Ratcliff (1978) for an early example, and Hébert and Woodford (2023) or Smith (2000) for related discussions.

Model

Suppose I want to learn about a state $\mu$ that may be high (equal to $H$) or low (equal to $L<H$). I observe a continuous sample path $(X_t)_{t\ge0}$ with random, instantaneous increments $$\DeclareMathOperator{\E}{E} \newcommand{\der}{\mathrm{d}} \newcommand{\R}{\mathbb{R}} \der X_t=\mu\der t+\sigma \der W_t,$$ where $\sigma>0$ amplifies the noise generated by the standard Wiener process $(W_t)_{t\ge0}$. These increments provide noisy signals of the state $\mu$. I use these signals, my prior belief $p_0=\Pr(\mu=H)$, and Bayes’ rule to form a posterior belief $$p_t\equiv \Pr\left(\mu=H\mid (X_s)_{s<t}\right)$$ about $\mu$ given the sample path observed up to time $t$. As shown below, this posterior belief has increments $$\der p_t=p_t(1-p_t)\frac{(H-L)}{\sigma}\der Z_t,$$ where $(Z_t)_{t\ge0}$ is a Wiener process with respect to my information at time $t$. Its increments $$\der Z_t=\frac{1}{\sigma}\left(\der X_t-\hat\mu_t\der t\right)$$ exceed zero precisely when the corresponding increments $\der X_t$ in the sample path exceed my posterior estimates $$\begin{align} \hat\mu_t &\equiv \E\left[\mu\mid (X_s)_{s<t}\right] \\ &= p_tH+(1-p_t)L. \end{align}$$

Learning dynamics

My belief increments $\der p_t$ get smaller as $p_t$ approaches zero or one. The ratio $(H-L)/\sigma$ controls how quickly this happens. Intuitively, if $(H-L)$ is large then the high and low states are easy to tell apart from the trends in $(X_t)_{t\ge0}$ they imply. But if $\sigma$ is large then these trends are blurred by the random fluctuations $\sigma\der W_t$.

I illustrate these dynamics in the chart below. It shows the sample paths $(X_t)_{t\ge0}$ and corresponding beliefs $(p_t)_{t\ge0}$ when $(H,L,\mu,p_0)=(1,0,H,0.5)$ and $\sigma\in\{1,2\}$. I use the same realization of the underlying Wiener process $(W_t)_{t\ge0}$ for each value of $\sigma$. Increasing this value slows my convergence to the correct belief $p_t=1$ because it makes the signals $\der X_t$ less informative about $\mu=H$.

Deriving the belief increments

The increments $\der W_t$ of the Wiener process $(W_t)_{t\ge0}$ are iid normally distributed with mean zero and variance $\der t$: $$\der W_t\sim N(0,\der t).$$ Thus, given $\mu$, the increments $\der X_t$ of the sample path $(X_t)_{t\ge0}$ are iid normal with mean $\mu\der t$ and variance $\sigma^2\der t$: $$\der X_t\mid\mu\sim N(\mu\der t,\sigma^2\der t).$$ So these increments have conditional PDF $$\begin{align} f_\mu(\der X_t) &= \frac{1}{\sigma\sqrt{2\pi\der t}}\exp\left(-\frac{(\der X_t-\mu\der t)^2}{2\sigma^2\der t}\right) \\ &= \frac{1}{\sigma\sqrt{2\pi\der t}}\exp\left(-\frac{(\der X_t)^2}{2\sigma^2\der t}\right)\exp\left(\frac{\mu\der X_t}{\sigma^2}-\frac{\mu^2\der t}{2\sigma^2}\right). \end{align}$$ But the rules of Itô calculus imply $(\der X_t)^2=\sigma^2\der t$ and $$\begin{align} \exp\left(\frac{\der X_t\mu}{\sigma^2}-\frac{\mu^2\der t}{2\sigma^2}\right) &= \sum_{k\ge0}\frac{1}{k!}\left(\frac{\mu\der X_t}{\sigma^2}-\frac{\mu^2\der t}{2\sigma^2}\right)^k \\ &= 1+\frac{\mu\der X_t}{\sigma^2} \end{align}$$ because these rules treat terms of order $(\der t)^2$ or smaller as equal to zero. Thus $$f_\mu(\der X_t)=\frac{1}{\sigma^3\sqrt{2\pi\der t}}\exp\left(-\frac{1}{2}\right)\left(\mu\der X_t+\sigma^2\right)$$ for each $\mu\in\{H,L\}$. Applying Bayes’ rule then gives $$\begin{align} p_{t+\der t} &= \frac{p_tf_H(\der X_t)}{p_tf_H(\der X_t)+(1-p_t)f_L(\der X_t)} \\ &= \frac{p_t\left(H\der X_t+\sigma^2\right)}{\hat\mu_t\der X_t+\sigma^2}, \end{align}$$ where $\hat\mu_t=\E[\mu\mid (X_s)_{s<t}]$ is my posterior estimate of $\mu$. So the belief process $(p_t)_{t\ge0}$ has increments $$\begin{align} \der p_t &\equiv p_{t+\der t}-p_t \\ &= \frac{p_t(1-p_t)(H-L)\der X_t}{\hat\mu_t\der X_t+\sigma^2}. \end{align}$$ Finally, taking a Maclaurin series expansion and applying the rules of Itô calculus gives $$\begin{align} \frac{\der X_t}{\hat\mu_t\der X_t+\sigma^2} &= \der X_t\sum_{k\ge0}\frac{(-1)^kk!}{(\sigma^2)^{k+1}}(\der X_t)^k \\ &= \der X_t\left(\frac{1}{\sigma^2}-\frac{1}{\sigma^4}\der X_t\right) \\ &= \frac{1}{\sigma^2}\left(\der X_t-\hat\mu_t\der t\right), \end{align}$$ from which we obtain the expressions for $\der p_t$ and $\der Z_t$ provided above.

Paying for precision

Tue, 04 Jul 2023 00:00:00 +0000

Suppose my payoff $u(a,\mu)\equiv-(a-\mu)^2$ from taking an action $a\in\mathbb{R}$ depends on an unknown state $\mu\in\mathbb{R}$.¹ I can learn about $\mu$ by collecting data $X=\{x_1,x_2,\ldots,x_n\}$, where the observations $x_i$ are iid normally distributed with mean $\mu$ and variance $\sigma^2$:² $$x_i\mid \mu\sim N(\mu,\sigma^2).$$ I use these data, my prior belief $$\mu\sim N(\mu_0,\sigma_0^2),$$ and Bayes’ rule to form a posterior belief $$\mu\mid X\sim N\left(\frac{\tau_0}{\tau_0+n\tau}\mu_0+\frac{n\tau}{\tau_0+n\tau}\bar{x},\frac{1}{\tau_0+n\tau}\right),$$ where $\tau_0\equiv1/\sigma_0^2$ is the precision of my prior, $\tau\equiv1/\sigma^2$ is the precision of the $x_i$, and $$\bar{x}\equiv\frac{1}{n}\sum_{i=1}^nx_i$$ is their arithmetic mean. Then my expected payoff from taking action $a$ equals $$\DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{Var} \E[u(a,\mu)\mid X]=-(a-\E[\mu\mid X])^2-\Var(\mu\mid X).$$ I maximize this payoff by choosing $a^*\equiv\E[\mu\mid X]$. This yields expected payoff $$\E[u(a^*,\mu)\mid X_n]=-\frac{1}{\tau_0+n\tau},$$ which is increasing in $n$. Intuitively, collecting more data makes me more informed and makes my optimal action more likely to be “correct.” But data are costly: I have to pay $\kappa n\tau$ to collect $n$ observations, where $\kappa>0$ captures the marginal cost of information.³ I choose $n$ to maximize my total payoff $$\begin{align*} U(n) &\equiv \E[u(a^*,\mu)\mid X]-\kappa n\tau, \end{align*}$$ which has maximizer $$n^*=\max\left\{0,\frac{1}{\tau}\left(\frac{1}{\sqrt\kappa}-\tau_0\right)\right\}.$$ If $1\le\sqrt\kappa\tau_0$ then $n^*=0$ because the cost of collecting any data isn’t worth the variance reduction they deliver. Whereas if $1>\sqrt\kappa\tau_0$ then $n^*$ is strictly positive and gives me total payoff $$U(n^*)=-2\sqrt\kappa+\kappa\tau_0.$$ Both $n^*$ and $U(n^*)$ are decreasing in $\kappa$. Intuitively, making the data more expensive makes me want to collect less, leaving me less informed and worse off. In contrast, making my prior more precise (i.e., increasing $\tau_0$) makes me want to collect less data but leaves me better off. This is because being well-informed means I can pay for less data and still be well-informed.

Curiously, making the $x_i$ more precise (i.e., increasing $\tau$) makes me want to collect more data but does not change my welfare. This is because the cost $\kappa\tau$ of each observation $x_i$ scales with its precision. This cost exactly offsets the value of the information gained, leaving my total payoff $U(n^*)$ unchanged.

See here for my discussion of the case when the state and data are binary. ↩︎
This is the same as letting $x_i=\mu+\varepsilon_i$ with iid errors $\varepsilon_i\sim N(0,\sigma^2)$. ↩︎
Pomatto et al. (2023) show that this cost function (uniquely) satisfies some attractive properties. Linear cost functions also appear in many sequential sampling problems (see, e.g., Wald’s (1945) classic model or Morris and Strack’s (2019) discussion of it) and their continuous-time analogues (see, e.g., Fudenberg et al. (2018) or Liang et al. (2022)). ↩︎

Binary signals and posterior variances

Sun, 02 Jul 2023 00:00:00 +0000

Suppose I receive a noisy signal $s\in\{0,1\}$ about an unknown state $\omega\in\{0,1\}$. The signal has false positive rate $$\renewcommand{\epsilon}{\varepsilon} \Pr(s=1\mid\omega=0)=\alpha$$ and false negative rate $$\Pr(s=0\mid\omega=1)=\beta$$ with $\alpha,\beta\in[0,0.5]$.¹ I use these rates, my prior belief $p=\Pr(\omega=1)$, and Bayes’ rule to form a posterior belief $$\begin{align} q_s &\equiv \Pr(\omega=1\mid s) \\ &= \frac{\Pr(s\mid\omega=1)\Pr(\omega=1)}{\Pr(s)} \\ &= \begin{cases} \frac{\beta p}{(1-\alpha)(1-p)+\beta p} & \text{if}\ s=0 \\ \frac{(1-\beta)p}{\alpha(1-p)+(1-\beta)p} & \text{if}\ s=1 \end{cases} \end{align}$$ that depends on the signal I receive.

Now suppose I take an action $a\in[0,1]$ with cost $c(a,\omega)\equiv(a-\omega)^2$. I want to minimize my expected cost $$\DeclareMathOperator{\E}{E} \begin{align} \E[c(a,\omega)\mid s] &= (1-q_s)c(a,0)+q_sc(a,1) \\ &= (1-q_s)a^2+q_s(a-1)^2 \end{align}$$ given $s$, which leads me to choose $a=q_s$. Then my minimized expected cost $$\begin{align} \E[c(q_s,\omega)\mid s] &= q_s(1-q_s) \\ &= p(1-p)\times\begin{cases} \frac{(1-\alpha)\beta}{\left((1-\alpha)(1-p)+\beta p\right)^2} & \text{if}\ s=0 \\ \frac{\alpha(1-\beta)}{\left(\alpha(1-p)+(1-\beta)p\right)^2} & \text{if}\ s=1 \end{cases} \end{align}$$ equals the posterior variance in my belief about $\omega$ after receiving $s$. The expected value of this variance before receiving $s$ equals $$\begin{align} V(p,\alpha,\beta) &\equiv q_0(1-q_0)\Pr(s=0)+q_1(1-q_1)\Pr(s=1) \\ &= p(1-p)\times\frac{\alpha(1-\alpha)(1-p)+\beta(1-\beta)p}{\left((1-\alpha)(1-p)+\beta p\right)\left(\alpha(1-p)+(1-\beta)p\right)}, \end{align}$$ which depends on my prior $p$ as well as the error rates $\alpha$ and $\beta$. For example, the chart below plots $$V(p,\epsilon,\epsilon)=p(1-p)\times\frac{\epsilon(1-\epsilon)}{p(1-p)+\epsilon(1-\epsilon)(1-2p)^2}$$ against $\epsilon$ when $p\in\{0.5,0.7,0.9\}$. If $\epsilon=0$ then the signal is fully informative because it always matches the state $\omega$. Larger values of $\epsilon\le0.5$ lead to less precise posterior beliefs. Indeed if $\epsilon=0.5$ then the signal is uninformative because $\Pr(s=1)=0.5$ (and, hence, $q_0=q_1=p$) independently of $\omega$. The slope $\partial V(p,\epsilon,\epsilon)/\partial\epsilon$ falls as my prior $p$ moves away from $0.5$ because having a more precise prior makes my beliefs less sensitive to the signal.

The next chart shows the contours of $V(p,\alpha,\beta)$ in the $\alpha\beta$-plane. These contours are symmetric across the diagonal line $\alpha=\beta$ when my prior $p$ equals $0.5$ but asymmetric when $p\not=0.5$. Intuitively, if I have a strong prior that $\omega=1$ then positive signals $s=1$ are less surprising, and shift my belief less, than negative signals $s=0$. So if $p>0.5$ then I need to increase the false positive rate $\alpha$ by more than I decrease the false negative rate $\beta$ to keep $V(p,\alpha,\beta)$ constant.

One consequence of this asymmetry is that the constrained minimization problem $$\min_{\alpha,\beta}V(p,\alpha,\beta)\ \text{subject to}\ 0\le\alpha,\beta\le0.5\ \text{and}\ \alpha+\beta\ge B$$ has a corner solution $$(\alpha^*,\beta^*)=\begin{cases} (0,B) & \text{if}\ p\le1/2 \\ (B,0) & \text{if}\ p>1/2 \end{cases}$$ for all lower bounds $B\in[0,0.5]$ on the sum of the error rates. Intuitively, if I can limit my exposure to false positives and negatives then I should prevent whichever occur in the state that’s most likely under my prior. For example, if $p>0.5$ then I’m best off allowing some false positives but preventing any false negatives. This makes negative signals fully informative because they only occur when $\omega=0$.

There is no loss in generality from assuming $\alpha,\beta\le0.5$ because observing $s$ is the same as observing $(1-s)$. ↩︎

Comparing equal- and value-weighted portfolios

Mon, 22 May 2023 00:00:00 +0000

Imagine two portfolios of S&P 500 companies. One portfolio weights all companies equally; the other weights companies by their market capitalization (hereafter “value”). Which portfolio is the better investment?

One way to answer this question is to look at historical data. For example, the Center for Research in Security Prices (CRSP) provides monthly returns on each portfolio between January 1926 and December 2022. I summarize these returns in the table below. They had overall means of 1.13% and 0.94%, standard deviations of 6.72% and 5.42%, and a Pearson correlation of 0.96.

Portfolio	Mean	Std. dev.	Min	Median	Max
Equal-weighted	1.13	6.72	-31.00	1.36	68.04
Value-weighted	0.94	5.42	-28.75	1.30	41.43

Suppose past and future returns have the same distribution. Then I expect the returns on the equal-weighted portfolio to be larger but riskier. So my preference over portfolios depends on my risk tolerance. I demonstrate this dependence in the chart below. It shows the certainty-equivalent (CE) return on each portfolio for a range of relative risk aversion (RRA) coefficients. The CE return equals the mean return when my RRA coefficient equals zero. It falls when my RRA coefficient rises because I demand a larger risk premium. The rate at which the CE return falls depends on portfolio’s return distribution. Based on the distributions summarized above, I prefer the equal-weighted portfolio whenever my RRA coefficient is less than 2.76.¹

Another way to compare the two portfolios is to look at their long-term growth rates. I do that in the chart below. It shows the capital gain I would have realized if I bought each portfolio in the past, reinvested my dividends, and sold my holdings at the end of 2022.² I make these gains comparable across holding periods by presenting them as mean monthly returns. For example, investing in the equal-weighted portfolio in December 2002 would have led to the same capital gain as investing in an asset that returned 0.90% every month for the next 20 years.

If I invested in either portfolio before September 2010, then I would have earned more on the equal-weighted portfolio. Its dominance over the value-weighted portfolio peaked in early 2000, when the dot-com crash saw lots of large companies lose lots of value.

Of course, past and future returns can differ. The equal-weighted portfolio may have been the better investment 20 years ago but could be a worse investment today. So what does the theory say?

Malladi and Fabozzi (2017) argue that the equal-weighted portfolio offers higher returns because it is regularly rebalanced. For example, if I start with equal shares in two companies, but one doubles in value and the other halves, then my portfolio will end with a 80/20 split. So if I want to maintain equal weights then I need to sell companies that grow a lot and buy companies that don’t. This contrarian strategy takes advantage of mean reversion. Indeed Plyakha et al (2021) argue that maintaining unequal weights would also lead to higher mean returns. These arguments agree with empirical evidence that few, if any, investing strategies consistently outperform weighting stocks equally (e.g., DeMiguel et al, 2009; Hsu et al, 2018; Qin and Singal, 2022).

Thanks to John Shoven for inspiring this post.

For reference, most macro/finance research uses coefficients between one and three. ↩︎
I focus on investments made before January 2020 to suppress the noise from (i) the COVID-19 pandemic and (ii) having few observations with which to compute means. ↩︎

Models of the AI apocalypse

Mon, 15 May 2023 00:00:00 +0000

In this week’s episode of EconTalk, Tyler Cowen asks:

“Is there any actual mathematical model of this process of how the world is supposed to end? … If you look, say, at COVID or climate change fears, in both cases, there are many models you can look at. … I’m not saying you have to like those models. But the point is: there’s something you look at and then you make up your mind whether or not you like those models; and then they’re tested against data. So, when it comes to AGI and existential risk, it turns out as best I can ascertain, in the 20 years or so we’ve been talking about this seriously, there isn’t a single model done.”

He goes on:

“I don’t think any idea should be dismissed. I’ve just been inviting [AI doomsayers] to actually join the discourse of science. ‘Show us your models. Let us see their assumptions and let’s talk about those.’ The practice, instead, is to write these very long pieces online, which just stack arguments vertically and raise the level of anxiety. … Their mental model is so much: ‘We’re the insiders, we’re the experts.’ … My mental model is: There’s a thing, science. Try to publish this stuff in journals. Try to model it.”

Good models don’t need to be complete descriptions of reality. But they do need to be logically consistent. Their purpose is to make explicit the assumptions and premises underlying our intuitions. Then we can subject those intuitions to formal scrutiny.

For example, suppose I think people should do X. I write down a model of the process by which they decide what to do. My model comprises a set of assumptions that imply X. Now I ask: Are my assumptions reasonable? Do I believe them? If not, then either (i) people shouldn’t do X or (ii) they don’t make decisions according to the process I’ve written down. Both cases teach me something: my intuition is wrong!

Tyler wants AI doomsayers to go on similar intellectual journeys. He wants to know: exactly what assumptions do they make when they say humanity is doomed? What are the logical foundations of that claim? Only by exposing those foundations can we test and revise them. That’s how science works. We tell each other how we think so that we can debate what we think. Models help us frame the debate. Sure, all models are wrong. But you can’t beat a model by waffling!

Who reads Marginal Revolution?

Mon, 08 May 2023 00:00:00 +0000

Here’s a summary of my website’s traffic since the start of 2023:

Notice the spike on April 9, when Tyler Cowen linked to my post of Marginal Revolution metadata. That post is now my second most-viewed ever (just behind my post on applying to economics PhD programs).

Where in the world did those views come from? Here’s a summary:

Most visitors came from the US. This makes sense: Marginal Revolution is run by American authors who tend to focus on American issues. About a third of my US-based visitors came from California, New York, or Massachusetts. Bigger states tended to bring more visitors, but the relationship was not perfect. For example, Californians comprise about 11.7% of the US population but 15.2% of my visitors. These percentages differ due to selection effects: Marginal Revolution caters to educated readers who share the authors’ interests. Indeed, all my visitors saw the word “metadata” and thought “I want to know more.” I doubt the typical American would react similarly!

Loan repayments

Tue, 18 Apr 2023 00:00:00 +0000

Suppose I take out a loan. It gains interest at rate $r$, compounded continuously. I repay the loan by making constant, continuous payments until time $T$. How does the repaid share of my loan vary over time? And how does it depend on $r$ and $T$?

Let $P_0$ be the initial value of my loan: the “principal.” Then my continuous payments $C$ must satisfy $$\begin{align} P_0 &= \int_0^TCe^{-r\tau}\,\mathrm{d}\tau \\ &= \frac{C}{r}\left(1-e^{-rT}\right) \end{align}$$ and so the value of my remaining payments at time $t\in[0,T]$ equals $$\begin{align} P_t &\equiv \int_t^TCe^{-r(\tau-t)}\,\mathrm{d}\tau \\ &= \frac{C}{r}\left(1-e^{-r(T-t)}\right) \\ &= P_0\left(\frac{e^{-rt}-e^{-rT}}{1-e^{-rT}}\right)e^{rt}. \end{align}$$ If I don’t make any payments before time $t$ then the principal grows to $P_0e^{rt}$. Therefore, the value of my repayments up to time $t$ equals the difference $(P_0e^{rt}-P_t)$.

Now let $x\equiv t/T\in[0,1]$ be share of payments I’ve made up to time $t$. The chart below plots the corresponding share $$\frac{P_0e^{rt}-P_t}{P_0e^{rt}}\bigg\rvert_{t=xT}=\frac{1-e^{-xrT}}{1-e^{-rT}}$$ of the loan that I’ve repaid. This share grows with $x$ at a decreasing rate. Intuitively, my repayment “slows down” because the interest on the principal and payments grows larger than the payments themselves. This slowing effect is stronger when the interest rate $r$ is larger and time horizon $T$ is longer.

Marginal Revolution metadata

Fri, 07 Apr 2023 00:00:00 +0000

Today I released the R package MRposts. It contains data on Marginal Revolution blog posts: their authors, titles, publication times, categories, and comment counts. I describe these data below. They cover all 34,189 posts published between August 2003 and March 2023.

Authors

Marginal Revolution is run by Tyler Cowen and Alex Tabarrok. They wrote 86% and 13% of the posts in MRposts. The rest were written by several guest bloggers. I count posts by author in the table below.

Author	Posts
Tyler Cowen	29,373
Alex Tabarrok	4,564
Fabio Rojas	63
Justin Wolfers	24
Steven Landsburg	19
Robin Hanson	17
Tim Harford	15
Craig Newmark	14
Ed Lopez	12
Bryan Caplan	11
Eric Helland	11
Angus Grier	10
12 others, each with fewer than ten posts	56

Tyler wrote the first Marginal Revolution post on August 21, 2003, and posted every day thereafter. His monthly output grew during the late 2000s and early 2010s. Alex’s monthly output was lower but relatively constant:

Titles

My next chart compares the words used in Tyler and Alex’s posts’ titles. Their posts often contained “assorted links” or “facts of the day,” or explained how there are “markets in everything.” Tyler also had many posts on “sentences to ponder” and “what [he’d] been reading.”

The longest title contained 21 words (“The Icelandic Stock Exchange fell by 76% in early trading as it re-opened after closing for two days last week."). Tyler’s titles had a median of five words while Alex’s had a median of four.¹

Publication times

Marginal Revolution posts tended to appear in early mornings and afternoons. Tyler posted at all hours of the day, albeit seldom at night.² Alex’s posting schedule was more regular. His posts usually appeared between 7am and 9am:

Comments

The median post in MRposts had 26 comments. Tyler’s median post had 27 comments while Alex’s had 25. About 11% of posts had more than 100 comments, while 26% had fewer than ten and 11% had none. I list the most-commented-on posts in the table below.

Post	Year	Comments
Sarah Palin	2008	947
The Case for Getting Rid of Borders-Completely	2015	711
If you wish to debate SCOTUS on Roe v. Wade…	2022	577
Classical liberalism vs. The New Right	2022	567
What the hell is going on?	2016	562
Upward Mobility and Discrimination: Asians and African Americans	2016	548
CWT bleg	2022	534
Ferguson and the Modern Debtor’s Prison	2014	525
Trump winning: who rises and falls in status?	2016	520
What is neo-reaction?	2016	519

Three of the ten most-commented-on posts were published in the last year. Indeed, the mean number of comments per post grew over time:

Post engagement grew slowly during the late 2010s. It increased sharply in early 2011, when Tyler was listed among the most influential economists.

Content

I could update MRposts to include data on posts’ content. This would allow users to mine the text of Tyler and Alex’s posts. For example, many commenters have decried Tyler’s recent focus on ChatGPT and other large language models. I document that focus in the chart below. It shows the share of Tyler’s posts containing the string “chat”, “GPT”, “LLM”, or “language model” in each of the past 24 months. The majority of those posts contained none of those strings!

Mark Nagelberg compares the mean lengths of all authors’ titles. ↩︎
Hamilton Noel looks closer at Tyler’s blogging habits. ↩︎

Five years of blogging

Wed, 01 Mar 2023 00:00:00 +0000

Today marks five years since my first blog post. This post is my 100th. It summarizes the words I’ve used and traffic I’ve received.

Words used

My first 99 posts contained more than 56 thousand words:

I wrote 11 posts in March and April 2020, when the pandemic forced me to “work” from home. I’ve written 56 posts—about once every 16 days—since starting my PhD in September 2020.

My longest post had 2,128 words and my shortest had 123. The most common (non-stop) word was “network,” used 269 times across 34 distinct posts. The chart below shows the six most common words overall and among posts on my most common topics. It includes “datum” rather than “data” because I lemmatize words before counting them.

So far I’ve written 34 posts on economics and 31 on networks. Most posts had multiple topics. The most commonly paired topics were networks and research (eight posts), research and software (six posts), and networks and statistics (six posts).

Traffic

Since March 2020 I’ve used GoatCounter to count page views and visitors. I had lots in late 2022, when I shared my reflections on graduate school and people started applying to economics PhD programs:

My most popular three posts benefit from being in the top few Google search results. They account for about half of my (non-bot) page views:

Post	Views
Applying to economics PhD programs	4,174
Accessing the Strava API with R	1,831
Greedy Pig strategies	1,189
Reflections on grad school: Years 1 and 2	411
Stanford	387
DeGroot learning in social networks	341
Ordinary and total least squares	336
Female representation and collaboration at the NBER	318
What’s it like living in America?	296
How central is Grand Central Terminal?	280
Other	4,747
Total	14,310

Most of my visitors were from the USA (usually California or Massachusetts):

Country	Visitors
United States	6,216
United Kingdom	770
Australia	642
New Zealand	561
India	294
Other/unknown	3,856
Total	12,339

Selection bias and fixed effects

Wed, 25 Jan 2023 00:00:00 +0000

Economists often use fixed effects to correct for selection bias. Intuitively, these effects “partial out” the reasons why our data include some observations but not others. But this intuition relies on the selection criteria being linear functions of the dependent variable.

For example, suppose I have panel data on 100 individuals $i$ at ten dates $t$. These data include pairs $(y_{it},x_{it})$ generated by the process $$y_{it}=x_{it}+u_i+\epsilon_{it},$$ where $u_i$ is a fixed effect and $\epsilon_{it}$ is an error term. The $x_{it}$, $u_i$, and $\epsilon_{it}$ are iid normal with zero mean and unit variance. They all vary across individuals. The $x_{it}$ and $\epsilon_{it}$ also vary over time, but the $u_i$ do not.

The chart below plots $y_{it}$ against $x_{it}$ overall and within two subsets of my data:

Observations for the 50 individuals $i$ whose outcomes $y_{it}$ have the largest mean;
Observations for the 50 individuals $i$ whose squared outcomes $y_{it}^2$ have the largest mean.

It also shows the OLS regression line fitted to my data and its subsets. The intercept and slope of this line depend on the selection criterion. Individuals with larger mean outcomes tend to have larger fixed effects and narrower error distributions. This leads OLS to estimate a higher intercept but shallower slope than in the full data. In contrast, individuals with larger mean squared outcomes have similar fixed effects to other individuals but wider error distributions. This leads OLS to estimate the same intercept but steeper slope than in the full data.

What if I include fixed effects in my regression? The box plots below summarize the slopes I estimate when I simulate my data 100 times and apply my selection criteria. Including fixed effects removes the bias from selecting on mean outcomes. This is because the fixed effects are the variables I select on. Partialing them out removes the selection bias by definition. In contrast, including fixed effects does not remove the bias from selecting on mean squared outcomes. This is because the fixed effects are uncorrelated with the variables I select on. Partialing them out removes noise but not bias.

Learning and persuasion

Sun, 08 Jan 2023 00:00:00 +0000

People talk for many reasons. One is to learn: to collect information that helps us make better choices. Another is to persuade: to convince others to make choices we think are best.

Meng (2021) shows how wanting to learn and persuade can lead to homophily. He presents a model in which people choose conversation partners before taking actions. Everyone wants these actions to match an unknown binary state. But people have different prior beliefs about the state. They update their beliefs after receiving (i) a noisy signal from nature and (ii) a message from their partner. Priors are public, signals are private, and messages are designed to be persuasive.

Meng studies the matchings that arise in this setting. A matching is “stable” if it has no “blocking pairs:” people who want to be partners but aren’t. It is “assortative” if all partners are “like-minded:” their priors are both close to zero or both close to one.

Every assortative matching is stable. To see why, suppose Alice and Bob are not like-minded. Alice will only partner with Bob if it’s easier to persuade him than be persuaded by him. But Bob will only partner with Alice if it’s easier to persuade her than be persuaded by her. These two conditions can’t hold at the same time, so Alice and Bob can’t form a blocking pair.

Likewise, in Meng’s model, every stable matching is assortative. This is especially true if people care more about learning than persuading. Like-minded partners send truthful messages because they don’t need to persuade each other. But non-like-minded partners send distorted messages hoping to persuade each other. These distortions make at least one person worse off than they would be if they had a like-minded partner who told them the truth.

Meng then considers a social planner who can choose matchings but not messages. This planner wants to maximize the sum of everyone’s expected payoffs under their priors. They choose an assortative matching only when the distribution of priors is symmetric. Otherwise, they choose a matching in which people with extreme priors have non-like-minded partners with moderate priors. This is because extremists gain more than moderates lose. It suggests that sorting is socially bad.

Finally, Meng extends his model to allow stable matchings that are not assortative. This can happen when signals or actions are not binary. He leaves open the extension to settings in which people have more than one partner. Jann and Schottmüller (2021) consider a version of that setting. But they reach a different normative conclusion than Meng: sorting can be good because it stops people from sending distorted messages.

Protecting Planet Xiddler

Sat, 07 Jan 2023 00:00:00 +0000

This week’s Riddler Classic asks us to fend off an alien invasion:

The astronomers of Planet Xiddler are back in action! Unfortunately, this time they have used their telescopes to spot an armada of hostile alien warships on a direct course for Xiddler. The armada will be arriving in exactly 100 days. (Recall that, like Earth, there are 24 hours in a Xiddler day.)

Fortunately, Xiddler’s engineers have just completed construction of the planet’s first assembler, which is capable of producing any object. An assembler can be used to build a space fighter to defend the planet, which takes one hour to produce. An assembler can also be used to build another assembler (which, in turn, can build other space fighters or assemblers). However, building an assembler is more time-consuming, requiring six whole days. Also, you cannot use multiple assemblers to build one space fighter or assembler in a shorter period of time.

What is the greatest number of space fighters the Xiddlerian fleet can have when the alien armada arrives?

We can solve this problem via dynamic programming. First, let $N(t)$ be the maximum number of fighters an assembler can make in $t$ days. The aliens invade in 100 days, so our goal is to compute $N(100)$.

An assembler can either

spend a day building fighters, or
spend six days duplicating itself.

The first option gives us 24 fighters plus however many an assembler can make in $(t-1)$ days. The second option gives us however many two assemblers can make in $(t-6)$ days. Thus $N(t)$ satisfies the Bellman equation $$N(t)=\max\{24+N(t-1),2N(t-6)\},$$ where $N(t)=0$ for all $t\le0$. Solving this equation recursively gives $$N(100)=7,\!864,\!320.$$ The chart below shows how $N(t)$ grows with $t$:

Fighter production begins with a 90-day “duplicate” phase in which the number of assemblers doubles 15 times: once every six days. This gives us $$2^{15}=32,\!768$$ assemblers to use during a 10-day “build” phase in which each builds 24 fighters per day, giving us $$2^{15}\times10\times24=7,\!864,\!320$$ fighters in total.

The length of the build phase depends on how quickly an assembler can build fighters or duplicate itself. For example, if it takes only three days to duplicate then the build phase lasts only four days. This is because the opportunity cost of duplicating (not building now) falls relative to the benefit of duplicating (building twice as fast later). The opposite is true if it takes more than six days to duplicate or if assemblers can build more than 24 fighters per day.

Learning from opinions

Fri, 06 Jan 2023 00:00:00 +0000

We often use others’ opinions to guide our choices. For example, we use movie and Yelp reviews to decide what to watch and where to eat. But opinions can be hard to interpret because they depend on objective facts (e.g., movie/food quality) and subjective perspectives (e.g., reviewers’ tastes). So, when seeking opinions, we face a trade-off between

“well-informed” sources who know a lot and
“well-understood” sources with known perspectives.

Sethi and Yildiz (2016) study this trade-off and its consequences. They consider a group of people who receive noisy signals about a sequence of states. These people form posterior beliefs (“opinions”) about each state based on their signal precisions (“expertise”) and prior beliefs (“perspectives”). Expertise is public, and varies across people and states. Perspectives are private, and vary across people but not states. Everyone observes their own opinion and the opinion of a chosen “target.” They always choose the target whose opinion reveals the most information about the current state.

Initially, no-one knows anyone else’s perspective, so everyone chooses the target with the most expertise (i.e., the most precise signal). But people learn others’ perspectives over time by comparing the signals they receive to the opinions they observe. Eventually, everyone attaches to a set of “long-run experts” and never considers opinions outside that set, even if those opinions are better informed.

This set of long-run experts can vary across people. To see why, suppose Alice and Bob observe Charlie’s opinion about a given state. Alice and Charlie receive precise signals about that state, but Bob doesn’t. Alice knows that her opinion can only differ from Charlie’s if they have different perspectives. In contrast, Bob can’t tell if his opinion differs from Charlie’s because they have different perspectives or because Bob’s signal is imprecise. So Alice learns more about Charlie’s perspective than Bob does. She’s more likely to include Charlie in her set of long-run experts.

Sethi and Yildiz’s model explains why people gravitate to like-minded opinion sources. We learn more about the perspectives of people who know about the same things, making us more likely to attach to them. This contrasts with the trust- and persuasion-based explanations discussed in previous posts. It leads people to ask experts for opinions on topics beyond their expertise. It may also lead people to befriend fellow ideologues who see the world the same (possibly incorrect) way.

stravadata demo

Sun, 01 Jan 2023 00:00:00 +0000

stravadata is an R package I use to organize and analyze my Strava activity data. This post offers some example analyses:

My examples use data on my running activities from the last five years:

library(dplyr)
library(lubridate)
library(stravadata)

runs = activities %>%
  filter(type == 'Run') %>%
  mutate(year = year(start_time),
         date = date(start_time)) %>%
  filter(year %in% 2018:2022)

Computing annual totals

runs contains activity-level features like distance traveled and time spent moving. I sum these features by year, then use knitr::kable to display these sums in a table:

library(knitr)

runs %>%
  group_by(Year = year) %>%
  summarise(Runs = n(),
            `Distance (km)` = sum(distance) / 1e3,
            `Time (hours)` = sum(time_moving) / 3600) %>%
  mutate_at(3:4, ~format(round(.), big.mark = ',')) %>%
  kable(align = 'crrr')

Year	Runs	Distance (km)	Time (hours)
2018	68	544	52
2019	152	1,085	92
2020	224	2,026	172
2021	207	2,149	173
2022	145	1,517	120

Making activity heat maps

I record my runs with a watch that tracks my GPS coordinates. stravadata stores these coordinates in streams. For example, here’s the course for last year’s Moonlight Run in Palo Alto:

library(ggplot2)

p = runs %>%
  filter(name == 'Moonlight Run' & year == 2022) %>%
  select(id) %>%
  left_join(streams, by = 'id') %>%
  ggplot(aes(lon, lat)) +
  geom_path()

plot_nicely(p)  # Add text and formatting

Combining the GPS coordinates from many runs yields a local map. For example, suppose I want to map my runs near Stanford. I first make a table of GPS paths near a local landmark:

coords = c(-122.16, 37.44)  # Trader Joe's
tol = 0.08

stanford_paths = streams %>%
  semi_join(runs, by = 'id') %>%
  mutate(step = row_number()) %>%
  filter(sqrt((lon - coords[1]) ^ 2 + (lat - coords[2]) ^ 2) < tol) %>%
  filter(lon != lag(lon) | lat != lag(lat)) %>%  # Remove pauses
  mutate(new_path = row_number() == 1 | id != lag(id) | step != lag(step) + 1) %>%
  mutate(path = cumsum(new_path)) %>%
  select(path, lat, lon)

I increment path every time I start a new run, unpause a previous run, or re-enter the area defined by coords and tol. I use path as a grouping variable so that ggplot2::ggplot knows to draw each path separately. I then use the alpha argument of ggplot2::geom_path to create a “heat map” of paths I run most often:

p = stanford_paths %>%
  ggplot(aes(lon, lat, group = path)) +
  geom_path(alpha = 0.1)

plot_nicely(p)

Counting efforts

best_efforts stores my fastest times running a range of distances (that Strava calls “efforts”) within each activity:

head(best_efforts)

## # A tibble: 6 × 4
##           id effort   start_index end_index
##        <dbl> <chr>          <int>     <int>
## 1 1253004287 1 mile            15       447
## 2 1253004287 1/2 mile          11       232
## 3 1253004287 1k                12       284
## 4 1253004287 2 mile            11       876
## 5 1253004287 400m              11       120
## 6 1253004287 5k                11      1342

The id column stores activity IDs and the effort column stores effort descriptions. I focus on 5k, 10k, and half marathon efforts:

focal_efforts = c('5k', '10k', 'Half-Marathon')

efforts = runs %>%
  left_join(best_efforts, by = 'id') %>%
  filter(effort %in% focal_efforts) %>%
  mutate(effort = factor(effort, focal_efforts)) %>%
  select(year, date, id, effort, start_index, end_index)

efforts inherits the year variable from runs. I use this variable to count efforts within each year. I then use tidyr::spread and knitr::kable to display these counts in a table:

library(tidyr)

efforts %>%
  count(Year = year, effort) %>%
  spread(effort, n, fill = 0) %>%
  kable(align = 'c')

Year	5k	10k	Half-Marathon
2018	64	24	0
2019	136	34	2
2020	191	88	21
2021	200	90	25
2022	131	85	9

Making training calendars

efforts also inherits the date variable from runs. I use this variable to create GitHub-esque training calendars. For example, here’s my running calendar for 2021:

p = efforts %>%
  filter(year == 2021) %>%
  group_by(date) %>%
  slice_max(effort) %>%
  distinct(effort) %>%  # I ran twice on some days
  mutate(Week = floor_date(date, 'weeks', week_start = 1),
         Weekday = wday(date, label = T, week_start = 1)) %>%
  ggplot(aes(Week, Weekday)) +
  geom_tile(aes(alpha = effort), col = 'white', linewidth = 0.5)

plot_nicely(p)

I use lubridate::floor_date to identify weeks and lubridate::wday to identify weekdays. The col and size arguments of ggplot2::geom_tile add space between tiles.

Tracking personal records

I combine runs, streams, and efforts to track my record running paces over time. I follow a three-step process:

First, I compute the mean pace for each effort. I do this using the start_index and end_index columns that efforts inherits from best_efforts. These columns tell me where each effort occurs in the corresponding activity’s stream:

effort_paces = streams %>%
  filter(id %in% runs$id) %>%
  # Create indices
  group_by(id) %>%
  mutate(index = row_number()) %>%
  ungroup() %>%
  # Extract stream segment for each effort
  inner_join(efforts, by = 'id') %>%
  filter(index >= start_index & index <= end_index) %>%
  # Compute mean paces
  group_by(id, date, effort) %>%
  summarise(distance = max(distance) - min(distance),
            time = max(time) - min(time)) %>%
  ungroup() %>%
  mutate(pace = (time / 60) / (distance / 1e3))

head(effort_paces)

## # A tibble: 6 × 6
##           id date       effort distance  time  pace
##        <dbl> <date>     <fct>     <dbl> <dbl> <dbl>
## 1 1335437333 2018-01-01 5k        5002.  1442  4.81
## 2 1338123783 2018-01-03 5k        5000.  1605  5.35
## 3 1344338907 2018-01-07 5k        5000.  1455  4.85
## 4 1347622521 2018-01-09 5k        5000   1493  4.98
## 5 1353889714 2018-01-13 5k        5001.  1622  5.41
## 6 1353889714 2018-01-13 10k      10001.  3380  5.63

The values in the distance column differ slightly from the descriptions in the effort column. This is because the stream segment doesn’t always cover the described distance exactly. But the multiplicative errors in distance and time should be equal on average, making pace is an unbiased estimate of my true mean pace. I measure this pace in minutes per kilometer.

Next, I extract my record paces by deleting efforts slower than my previous best:

record_paces = effort_paces %>%
  group_by(effort) %>%
  arrange(date) %>%
  filter(pace == cummin(pace)) %>%
  ungroup()

Finally, I “fill in the gaps” by adding days on which I don’t set a new record. I do this using tidyr::crossing and tidyr::fill:

date_range = seq(date('2018-01-01'), date('2022-12-31'), by = 'day')

record_paces_filled = crossing(date = date_range, effort = focal_efforts) %>%
  left_join(record_paces) %>%
  group_by(effort) %>%
  fill(pace) %>%
  filter(!is.na(pace))

record_paces and record_paces_filled differ in that the latter includes date-effort pairs with no new records. This makes record_paces_filled produce horizontal lines when I plot its data:

p = record_paces_filled %>%
  ggplot(aes(date, pace, group = effort)) +
  geom_line()

plot_nicely(p)

Social networks in rural India

Sat, 24 Dec 2022 00:00:00 +0000

IndianVillages is a new R package containing data on social networks in rural India. I derived these data from Banerjee et al.‘s (2013) surveys of households across 75 Karnatakan villages. This post describes the derived data and the networks they define. I also show that the networks are assortatively mixed with respect to caste.

Data description

IndianVillages provides two tables. The first, households, links each household to its village and caste:

library(dplyr)
library(IndianVillages)

head(households)

## # A tibble: 6 × 3
##    hhid village caste
##   <dbl>   <dbl> <chr>
## 1  1001       1 <NA> 
## 2  1002       1 <NA> 
## 3  1003       1 <NA> 
## 4  1004       1 <NA> 
## 5  1005       1 <NA> 
## 6  1006       1 <NA>

The hhid and village columns store household and village IDs. The caste column stores caste memberships:

count(households, caste, sort = T)

## # A tibble: 6 × 2
##   caste               n
##   <chr>           <int>
## 1 OBC              5517
## 2 <NA>             4455
## 3 Scheduled Caste  2584
## 4 General          1371
## 5 Scheduled Tribe   618
## 6 Minority          359

Some caste values are missing because the surveys were changed during their collection. About 53% of the households with known castes are in the Other Backward Class (“OBC”). This exceeds the (disputed) share of OBCs in India’s general population during the survey period.

The second table, household_relationships, contains information on inter-household relationships:

head(household_relationships)

## # A tibble: 6 × 4
##   hhid.x hhid.y village type                        
##    <dbl>  <dbl>   <dbl> <fct>                       
## 1   1001   1002       1 Help with a decision        
## 2   1001   1002       1 Borrow kerosene or rice from
## 3   1001   1002       1 Lend kerosene or rice to    
## 4   1001   1002       1 Are related to              
## 5   1001   1002       1 Invite to one's home        
## 6   1001   1002       1 Visit in another's home

The hhid.x and hhid.y columns store ego and alter household IDs. The type column stores relationship types:

count(household_relationships, type, sort = T)

## # A tibble: 12 × 2
##    type                             n
##    <fct>                        <int>
##  1 Visit in another's home      33629
##  2 Invite to one's home         32652
##  3 Engage socially with         30939
##  4 Borrow money from            25514
##  5 Lend kerosene or rice to     23993
##  6 Borrow kerosene or rice from 23743
##  7 Lend money to                23558
##  8 Obtain medical advice from   22310
##  9 Help with a decision         17228
## 10 Are related to               16037
## 11 Give advice to               15613
## 12 Go to temple with             2700

These types correspond to questions asked in Banerjee et al.‘s surveys.

Inter-household networks

We can use households and household_relationships to define social networks among the households in each village. First, use the graph_from_data_frame function from igraph to create the network among all households:

library(igraph)

net = graph_from_data_frame(
  distinct(household_relationships, hhid.x, hhid.y),
  directed = F,
  vertices = households
)

net contains 66,862 edges: one for each pair of households with at least one social relationship. There are no between-village relationships in the data, so we can partition net into village-specific networks without deleting any edges:

library(purrr)

villages = sort(unique(households$village))

village_nets = map(villages, ~subgraph(net, V(net)$village == .))

sum(map_dbl(village_nets, gsize))  # Same as gsize(net)

## [1] 66862

The networks in village_nets are too large to describe visually. Instead, let’s compute some of their properties:

village_nets_properties = map_df(village_nets, ~{
  comp = components(.)
  giant = subgraph(., comp$membership == which.max(comp$csize))
  tibble(
    Households = gorder(.),
    `Mean degree` = mean(degree(.)),
    `% of households in giant` = 100 * gorder(giant) / gorder(.),
    `Mean distance in giant` = mean_distance(giant)
  )
})

I summarize these properties in the table below. The number of households in each village ranges from 77 to 356. The mean degree of the households in each village ranges from 6.11 to 13.44. Most households are in the giant component for their village, and are connected to others in that component via paths of length two or three.

Property	Mean	Std. dev.	Min.	Median	Max.
Households	198.72	59.29	77.00	190.00	356.00
Mean degree	8.90	1.61	6.11	8.72	13.44
% of households in giant	95.10	2.71	84.62	95.54	99.42
Mean distance in giant	2.75	0.21	2.30	2.72	3.32

Inter-caste mixing

We can use net to study the extent of assortative mixing with respect to caste membership. First, delete the 4,455 households with missing caste values:

subnet = subgraph(net, !is.na(V(net)$caste))

subnet contains 10,449 households with a mean degree of 9.08. This is similar to the mean degree in net. The two networks also have similar mean distances between connected households: 2.85 in subnet, versus 2.81 in net.

Next, compute subnet's mixing matrix:

library(bldr)  # https://github.com/bldavies/bldr

mix_mat = get_mixing_matrix(subnet, 'caste')

I define get_mixing_matrix here. It returns a matrix in which rows and columns correspond to castes, and entries equal the share of edges joining households in each caste pair. Multiplying these entries by the sum of degrees—which, by the degree sum formula, equals twice the number of edges—yields a table of inter-caste edge counts:

mix_mat * (2 * gsize(subnet))

##             
##              General Minority   OBC Sch. Caste Sch. Tribe
##   General       8680       79  3118        932        521
##   Minority        79     1860   381        156         84
##   OBC           3118      381 40058       4325       2241
##   Sch. Caste     932      156  4325      16074        910
##   Sch. Tribe     521       84  2241        910       2722

For example, subnet contains 3,118 edges between households in general castes and households in OBC castes.

We can measure the extent of assortative mixing by comparing mix_mat to the matrix we’d expect if edges were independent of caste. This matrix equals the outer product of the row and column sums of mix_mat:

mix_mat_indep = rowSums(mix_mat) %*% t(colSums(mix_mat))

Comparing the traces of mix_mat and mix_mat_indep allows us to measure mixing overall:

tr = function(m) sum(diag(m))

c(tr(mix_mat), tr(mix_mat_indep))

## [1] 0.7313254 0.3598672

So subnet contains about twice as many within-caste edges than we’d expect if edges were independent of caste.

We can also compare mix_mat and mix_mat_indep element-wise to assess which inter-caste relationships are most over-represented:

round(mix_mat / mix_mat_indep, 2)

##             
##              General Minority   OBC Sch. Caste Sch. Tribe
##   General       4.64     0.22  0.44       0.30       0.57
##   Minority      0.22    26.93  0.28       0.26       0.48
##   OBC           0.44     0.28  1.51       0.37       0.65
##   Sch. Caste    0.30     0.26  0.37       3.04       0.60
##   Sch. Tribe    0.57     0.48  0.65       0.60       6.15

So, for example, there are about 51% more OBC-OBC edges than we’d expect if edges were independent of caste, but less than half as many general-OBC edges.

Optimal pacing with random energy costs

Mon, 12 Dec 2022 00:00:00 +0000

My previous post discussed how I should pace myself in a running race. I allowed the cost of running fast to vary during the race, and I showed how the costs I faced determined my optimal speeds and finish time. I assumed that I knew the costs in advance—for example, that I knew which parts of the race had the steepest hills and the strongest headwinds.

But sometimes I don’t know the costs of running fast in advance. For example, I might not know the terrain or how the weather will turn out. This uncertainty prevents me from committing to a pacing strategy before the race begins. Instead, I must adapt my strategy to the costs I encounter during the race.

This post discusses my optimal pacing strategy when I face random energy costs. I assume these costs follow a Markov chain. This allows me to solve for my optimal speeds and finish times numerically. I show that my ex ante expected time falls when my costs become more variable and less persistent. I also show that my realized time depends on the number and timing of high cost realizations.

Allowing for random costs

The setup is similar to my previous post: I have $k>0$ units of energy to allocate across $N$ laps $n\in\{1,2,\ldots,N\}$. It costs $c_ns_n$ units to run at speed $s_n$ in lap $n$, where the per-unit cost $c_n>0$ varies with $n$. I want to minimize my total time $$\DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{Var} \renewcommand{\epsilon}{\varepsilon} T\equiv\sum_{n=1}^N\frac{1}{s_n}$$ subject to the dynamic energy constraint $$k_{n+1}=k_n-c_ns_n,$$ boundary conditions $k_1=k$ and $k_{N+1}=0$, and non-negativity constraint $s_n\ge0$. But now the costs $c_n$ are random. So I can’t choose the entire speed sequence $(s_n)_{n=1}^N$ before the race begins. Instead, I choose each term $s_n$ after observing the cost history $c_1,c_2,\ldots,c_n$.

For simplicity, I assume $c_n\in\{1-\epsilon,1+\epsilon\}$ for some $\epsilon\in[0,1)$, and that $\Pr(c_1=1+\epsilon)=0.5$ and $\Pr(c_{n+1}=c_n)=p$ for each $n$. The probability $p$ controls costs’ persistence: if $p=1$ then they never change, whereas if $p=0$ then they change every lap. The cost in lap $n+1\le N$ has conditional mean $$\begin{align} \E_n[c_{n+1}] &\equiv \E[c_{n+1}\mid c_1,c_2,\ldots,c_n] \\ &= \begin{cases} 1+(2p-1)\epsilon & \text{if}\ c_n=1+\epsilon \\ 1-(2p-1)\epsilon & \text{if}\ c_n=1-\epsilon \end{cases} \end{align}$$ and variance $$\begin{align} \Var_n(c_{n+1}) &= \E_n[c_{n+1}^2]-\E_n[c_{n+1}]^2 \\ &= 4p(1-p)\epsilon^2, \end{align}$$ where $\E_n$ takes expectations given the first $n$ cost realizations, and where $\epsilon$ controls the variance of $c_{n+1}$. For example, if $p=0.5$ then costs are independent across laps, and so $\E_n[c_{n+1}]=1$ and $\Var_n(c_{n+1})=\epsilon^2$ for each $n$. But as $p$ moves away from 0.5, knowing the cost history $c_1,c_2,\ldots,c_n$ gives me more information about $c_{n+1}$, thereby decreasing $\Var_n(c_{n+1})$.

Solving the problem

Facing random costs forces me to solve my pacing problem sequentially: to choose each speed $s_n$ based on the observed cost history $c_1,c_2,\ldots,c_n$ and distribution of future costs $c_{n+1},c_{n+2},\ldots,c_N$. This is equivalent to choosing the amount of energy $k_{n+1}$ to carry into the next lap. I make this choice via the Bellman equation $$V_n=\min_{k_{n+1}}\left\{\frac{c_n}{k_n-k_{n+1}}+\E_n[V_{n+1}]\right\},$$ where $$V_n\equiv\sum_{m=n}^N\frac{1}{s_m}$$ is the time taken to run laps $n$ through $N$. It turns out that $$V_n=\frac{a_n}{k_n}$$ for each $n\in\{1,2,\ldots,N+1\}$, where the coefficients $a_1,a_2,\ldots,a_{N+1}$ are defined recursively by $$\begin{align} a_{N+1} &= 0 \\ a_n &= \left(\sqrt{c_n}+\sqrt{\E_n[a_{n+1}]}\right)^2. \end{align}$$ If the costs $c_n$ are non-random then the coefficients $a_n$ are also non-random, and we obtain the solution described in my previous post. But if the costs are random then so are the $a_n$, and calculating them involves a case-wise analysis that grows exponentially with $N$. Instead, I proceed numerically: by computing $a_n$ in each cost state $c_n$ given the implied distribution of future states. This is possible because the cost sequence is a Markov chain, which means that $c_n$ is a sufficient statistic for the future costs $c_{n+1},c_{n+2},\ldots,c_N$. I use this property to compute the optimal speeds $$s_n=\frac{k_n}{c_n+\sqrt{c_n\E_n[a_{n+1}]}}$$ and finish time $T$ associated with each cost sequence realization.

Ex ante expected times

Consider the case with full cost persistence: $p=1$. Then $c_n=c_1$ for each $n$, from which it follows that $T=N^2c_1/k$. But $c_1$ has mean $\E[c_1]=1$, so my ex ante expected time with $p=1$ equals $$\E[T\mid p=1]=\frac{N^2}{k}.$$ This is the finish time I expect if I know costs are constant but don’t know if they’re high or low. Conversely, if I know costs always alternate (i.e., that $p=0$), then my ex ante expected time equals $$\E[T\mid p=0]=\frac{N^2\E[\sqrt{c_1}]^2}{k}+\begin{cases} 0 & \text{if}\ N\ \text{is even} \\ \Var(\sqrt{c_1})/k & \text{if}\ N\ \text{is odd}. \end{cases}$$ The additional $\Var(\sqrt{c_1})$ term when $N$ is odd comes from the cost sequence being imbalanced: it has $(N-1)/2+1$ copies of $c_1$ but only $(N-1)/2$ copies of the other cost value. This imbalance becomes inconsequential as $N$ becomes large. Thus $$\begin{align} \E[T\mid p=1]-\E[T\mid p=0] &\approx \frac{N^2}{k}\left(1-\E[\sqrt{c_1}]^2\right) \\ &= \frac{N^2}{2k}\left(1-\sqrt{1-\epsilon^2}\right), \end{align}$$ which grows with $\epsilon$. We can understand this growth via the following chart. It shows how my expected time increases from $\E[T\mid p=0]$ to $\E[T\mid p=1]$ as $p$ increases from zero to one. Intuitively, if $p$ is large then I could face persistently high costs that slow me down. But if $p$ is small then high costs are likely to be “cancelled out” by low costs, improving my optimal time. The benefit of this canceling grows as the difference $2\epsilon$ between high and low costs grows.

Realized times

The relationship between my actual and expected times depends on the realized cost sequence. I demonstrate this dependence in the table below. It shows the mean (standard deviation) of my actual and expected times across 100 simulated 100-lap races with 25, 50, and 75 high-cost laps. It also shows the “oracle” time I would obtain if I knew the cost sequence in advance. This time depends only on the number of high-cost laps, whereas my actual time depends on both the number and order of such laps. Likewise, my expected time depends only on my parameter choices: $N=100$, $k=100$, $\epsilon=0.5$, and $p=0.5$. These parameters are constant across simulated races, so my expected time is also constant.

High-cost laps	Actual time	Expected time	Oracle time
25	71.38 (0.37)	93.65	69.98
50	93.50 (0.14)	93.65	93.30
75	122.00 (0.58)	93.65	119.98

I finish faster than expected when I face 25 high-cost laps, which is unexpectedly few. Whereas I finish slower than expected when I face 75 high-cost laps, which is unexpectedly many. I always finish slower than the oracle time because I have to optimize my speeds sequentially, whereas the oracle has the option to optimize them all at once.

The difference between my actual and oracle times depends on the order in which I encounter high- and low-cost laps. For example, consider the following four orderings:

50 high-cost laps followed by 50 low-cost laps;
50 low-cost laps followed by 50 high-cost laps;
25 low-cost laps followed by 25 high-cost laps, repeated twice;
10 low-cost laps followed by 10 high-cost laps, repeated five times.

I assume the same parameters $(N,k,\epsilon,p)$ as in the simulations above, so my ex ante expected time equals 93.65 in all four orderings. Likewise, my oracle time equals 93.30 in all orderings because they all contain 50 high-cost laps and 50 low-cost laps. This oracle time comes from choosing speeds before each race begins, whereas my actual time comes from choosing speeds when I start each lap. I compare these choices in the chart below.

Consider the first ordering, with 50 high-cost laps followed by 50 low-cost laps. I start at about the same speed as the oracle. But I slow down in laps two through 50 to preserve my energy, which is unexpectedly expensive. I speed up in lap 51 when energy becomes cheap, then keep speeding up as energy keeps being unexpectedly cheap. I sprint the last few laps to use the excess energy I saved from running slow earlier. Whereas the oracle never has excess energy: it always uses the optimal amount, maintaining a constant speed in each block of constant-cost laps. This makes the oracle time 3.6% faster than my actual time.

Now consider the second ordering, with 50 low-cost laps followed by 50 high-cost laps. Again, I start at about the same speed as the oracle. But now I speed up in laps two through 50 because energy is unexpectedly cheap. I slow down in lap 51 when energy becomes expensive, then keep slowing down as energy keeps being unexpectedly expensive. I bonk in the last few laps, having used too much energy by running too fast in the first half. Whereas the oracle never bonks. It finishes 4.3% faster than me.

Having shorter blocks of constant-cost laps narrows the gap between my actual and oracle times. This is because short blocks prevent me from straying too far from the oracle’s energy consumption path. Intuitively, the more frequently I encounter different costs, the more these costs meet my expectations, and so the less I respond to costs being unexpectedly expensive or cheap. Indeed, my actual time approaches the oracle time as the blocks of constant-cost laps approach one-lap lengths. This echoes my earlier discussion of ex ante expected times: I finish faster when costs are less persistent.

Optimal pacing with varying energy costs

Sun, 11 Dec 2022 00:00:00 +0000

Suppose I’m running a race. I have a fixed amount of energy to “spend” on running fast. But the energy cost of running fast varies during the race (e.g., it’s high on hills and low on flats). How should I pace myself to minimize my race time?

This post discusses my optimal pacing problem. I describe it mathematically, derive its solution in simple and general settings, and analyze these solutions’ properties. I assume energy costs are deterministic, whereas my next post allows them to be random.

The optimal pacing problem

My race consists of $N$ “laps” $n\in\{1,2,\ldots,N\}$ with equal lengths. I start with $k_1=k>0$ units of energy and finish the race with none. Running lap $n$ at speed $s_n$ costs $c_ns_n$ units of energy, where $c_n>0$ varies with $n$.

My goal is to find the speed sequence $(s_n)_{n=1}^N$ that minimizes my total time¹ $$\DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{Var} \newcommand{\der}{\mathrm{d}} \newcommand{\parfrac}[2]{\frac{\partial #1}{\partial #2}} T\equiv\sum_{n=1}^N\frac{1}{s_n}$$ subject to the dynamic energy constraint $$k_{n+1}=k_n-c_ns_n,$$ boundary conditions $k_1=k$ and $k_{N+1}=0$, and non-negativity constraint $s_n\ge0$.

Solving the two-lap case

We can build intuition by solving the case with $N=2$. Then the dynamic constraint and boundary conditions imply $$T=\frac{c_1}{k-k_2}+\frac{c_2}{k_2},$$ where $k_2$ is the energy I choose to leave for the second lap. It satisfies the first-order condition $\partial T/\partial k_2=0$, which we can write as $$\frac{c_1}{(k-k_2)^2}=\frac{c_2}{k_2^2}.$$ The left-hand side is the marginal cost (in units of total time) of using less energy in the first lap. The right-hand side is the marginal benefit of using more energy in the second lap. The first-order condition balances this marginal cost and benefit. It determines how I should smooth my energy consumption across laps.

Rearranging the first-order condition for $k_2$ gives $$k_2=\frac{\sqrt{c_2}}{\sqrt{c_1}+\sqrt{c_2}}k.$$ So I should spend my energy proportionally to the square roots of the costs I face. For example, if $c_1=4c_2$ then I should spend a third of my energy on the first lap and two thirds on the second. This leads me to run twice as fast on the second lap and makes my total time equal $9c_2/k$. In contrast, if I spent energy proportionally to costs then I would spend a fifth on the first lap and four fifths on the second. I would run at a constant speed and my total time would equal $10c_2/k$. That strategy would be optimal if the costs were constant at their mean $5c_2/2$. But they aren’t constant: they vary by a factor of four. Square-root scaling takes advantage of this variation. It makes me run slow when it’s expensive to run fast.

Solving the general case

The results and intuitions from the case with $N=2$ generalize to cases with $N>2$. But those cases require more powerful solution methods. I explain two: using the Hamiltonian and using the Bellman equation. The first is faster, but the second is more intuitive and extends naturally to a setting with random costs.

Using the Hamiltonian

The Hamiltonian for my optimal pacing problem is $$H\equiv-\frac{1}{s_n}-\lambda_{n+1}c_ns_n,$$ where $\lambda_{n+1}$ is a costate that satisfies $$\lambda_{n+1}-\lambda_n=-\parfrac{H}{k_n}$$ for each $n$. But $\partial H/\partial k_n=0$ and so $\lambda_{n+1}=\lambda$ is constant. Substituting it into the first-order condition $\partial H/\partial s_n=0$ gives $$s_n=\frac{1}{\sqrt{\lambda c_n}}.$$ Now the dynamic constraint and boundary conditions imply $$\sum_{n=1}^Nc_ns_n=k,$$ from which it follows that $$\sqrt\lambda=\frac{1}{k}\sum_{n=1}^N\sqrt{c_n}$$ and therefore $$s_n=\frac{k}{\sqrt{c_n}\sum_{m=1}^N\sqrt{c_m}}$$ for each $n$. Then my total time equals $$T=\frac{1}{k}\left(\sum_{n=1}^N\sqrt{c_n}\right)^2.$$ For example, letting $N=2$ and $c_1=4c_2$ yields the optimal time $T=9c_2/k$ described above.

Using the Bellman equation

The dynamic constraint implies $$s_n=\frac{k_n-k_{n+1}}{c_n}$$ for each $n.$ Consequently, the cost sequence $(c_n)_{n=1}^N$ and “remaining energy” sequence $(k_{n+1})_{n=0}^N$ uniquely determine the speed sequence $(s_n)_{n=1}^N$. So if $$V_n\equiv\sum_{m=n}^N\frac{1}{s_n}$$ denotes the time spent running laps $n$ through $N$ when I pace myself optimally, then $V_n$ must satisfy the Bellman equation $$V_n=\min_{k_{n+1}}\left\{\frac{c_n}{k_n-k_{n+1}}+V_{n+1}\right\}.$$ This equation echoes my objective in the two-lap case. Intuitively, my optimal speeds in the $N$-lap case solve a sequence of two-lap problems, where the second “lap” is the remainder of my race.

We can solve the Bellman equation using the method of undetermined coefficients. Suppose $V_{n+1}=a_{n+1}/k_{n+1}$ for some $a_{n+1}\ge0$. Then, under optimal pacing, we have $$\begin{align} 0 &= \parfrac{}{k_{n+1}}\left(\frac{c_n}{k_n-k_{n+1}}+V_{n+1}\right) \\ &= \frac{c_n}{(k_n-k_{n+1})^2}-\frac{a_{n+1}}{k_{n+1}} \end{align}$$ and therefore $$k_{n+1}=\frac{\sqrt{a_{n+1}}}{\sqrt{a_{n+1}}+\sqrt{c_n}}k_n.$$ Substituting this recurrence into the Bellman equation gives $V_n=a_n/k_n$, where $$a_n\equiv\left(\sqrt{a_{n+1}}+\sqrt{c_n}\right)^2$$ and $a_{N+1}=0$. Solving recursively gives $$\sqrt{a_n}=\sum_{m=n}^N\sqrt{c_m}$$ for each $n$, from which it follows that $$k_{n+1}=\frac{\sum_{m=n+1}^N\sqrt{c_m}}{\sum_{m=1}^N\sqrt{c_m}}k$$ and $$s_n=\frac{k}{\sqrt{c_n}\sum_{m=1}^N\sqrt{c_m}}.$$ So we get the same optimal speed sequence and total time as obtained using the Hamiltonian. We also see the square-root scaling from the two-lap case generalize to the $N$-lap case. For example, if the costs I face in the first half of the race are four times the costs I face in the second, then I should run half as fast in the first half than I run in the second.

Solution properties

As explained above, each speed term $s_n$ scales with the inverse square-root of the corresponding cost term $c_n$. This scaling takes advantage of the variation in costs faced during my race. But scaling all of the cost terms has a linear effect: doubling each $c_n$ halves each $s_n$ and so doubles my total time $T$. Likewise, doubling my initial energy $k$ doubles each $s_n$ and so halves $T$. These linearities come from the linearity of the dynamic constraint $k_{n+1}=k_n-c_ns_n$.

Rearranging the cost sequence $(c_n)_{n=1}^N$ leads to the same rearrangement of the optimal speed sequence $(s_n)_{n=1}^N$. This is because the sequences satisfy $$\sqrt{c_n}s_n=\frac{k}{\sum_{m=1}^N\sqrt{c_m}},$$ the right-hand side of which doesn’t change if I rearrange the $c_n$. Nor does my minimized time $T$ change. Intuitively, swapping the laps on which I run slow and fast doesn’t change my average pace.

Whereas variation in costs improves my average pace. To see how, let $$\E[c_n]\equiv\frac{1}{N}\sum_{n=1}^Nc_n$$ be the empirical mean cost of energy during my race and let $$\overline{T}\equiv\frac{\E[c_n]N}{k}$$ be my optimal time when $c_n=\E[c_n]$ for each $n$. Then $$\begin{align} \overline{T}-T &= \frac{N}{k}\left(\E[c_n]-\frac{1}{N^2}\left(\sum_{n=1}^N\sqrt{c_n}\right)^2\right) \\ &= \frac{N}{k}\left(\E[\sqrt{c_n}^2]-\E[\sqrt{c_n}]^2\right) \end{align}$$ and therefore $$T=\overline{T}-\frac{N}{k}\Var(\sqrt{c_n}),$$ where $\Var(\sqrt{c_n})=\E[\sqrt{c_n}^2]-\E[\sqrt{c_n}]^2$ is the empirical variance of the $\sqrt{c_n}$. So applying a mean-preserving spread to the distribution of $\sqrt{c_n}$ values lowers my optimal time $T$. But this is not the same as increasing the variance in $c_n$. For example, consider the cost sequences $(c_n)_{n=1}^{100}$ and $(c_n')_{n=1}^{100}$ defined by $$c_n=\begin{cases} 145 & \text{if}\ n\le 50 \\ 55 & \text{otherwise} \end{cases}$$ and $$c_n'=\begin{cases} 200 & \text{if}\ n\le 20 \\ 75 & \text{otherwise}. \end{cases}$$ Then $\E[c_n]=\E[c_n']=100$, while $\Var(c_n)=2025$ and $\Var(c_n')=2500$. So the $c_n$ have lower variance than the $c_n'$. But $\Var(\sqrt{c_n})\approx5.3$ is larger than $\Var(\sqrt{c_n'})\approx4.8$, which means my optimal time is smaller under $(c_n)_{n=1}^{100}$ than under $(c_n')_{n=1}^{100}$. Intuitively, I prefer cost sequences with a mix of highs and lows to sequences with a few sharp highs and lots of mild lows.

Replacing $T$ with $T/N$, and letting $x=n/N$ and $N\to\infty$, yields a special (linear) case of the problem discussed in my post on negative splits. ↩︎

Correlation and concatenation

Thu, 17 Nov 2022 00:00:00 +0000

Suppose I have data $(a_i,b_i)_{i=1}^n$ on two random variables $A$ and $B$. I store my data as vectors a and b, and compute their correlation using the cor function in R:

cor(a, b)

## [1] 0.4326075

Now suppose I append a mirrored version of my data by defining the vectors

alpha = c(a, b)
beta = c(b, a)

so that alpha is a concatenation of the $a_i$ and $b_i$ values, and beta is a concatenation of the $b_i$ and $a_i$ values. I compute the correlation of alpha and before as before:

cor(alpha, beta)

## [1] 0.4288428

Notice that cor(a, b) and cor(alpha, beta) are not equal. This surprised me. How can appending a copy of the same data change the correlation within those data?

The answer is that the concatenated data $(\alpha_i,\beta_i)_{i=1}^{2n}$ have different marginal distributions than the original data $(a_i,b_i)_{i=1}^n$. Indeed one can show that $$\DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{Var} \begin{align} \E[\alpha]=\E[\beta]=\frac{\E[a]+\E[b]}{2} \end{align}$$ and $$\begin{align} \E[\alpha^2]=\E[\beta^2]=\frac{\E[a^2]+\E[b^2]}{2}, \end{align}$$ where $$\E[\alpha]\equiv\frac{1}{2n}\sum_{i=1}^n\alpha_i$$ is the empirical mean of the $\alpha_i$ values, and where $\E[\beta]$, $\E[a]$, and $\E[b]$ are defined similarly. It turns out that $\E[\alpha\beta]=\E[ab]$, but since the marginal distributions are different the empirical correlations are different. In fact $$\Cor(\alpha,\beta)=\frac{\Cov(a,b)-0.25\left(\E[a]+\E[b]\right)^2}{0.5\Var(a)+0.5\Var(b)+0.25\left(\E[a]-\E[b]\right)^2},$$ where $\Cor$, $\Cov$, and $\Var$ are the empirical correlation, covariance, and variance operators. This expression implies that cor(alpha, beta) and cor(a, b) will be equal if the $a_i$ and $b_i$ values have the same means and variances. We can achieve this by scaling a and b before computing their correlation:

cor(scale(a), scale(b))

## [1] 0.4326075

The scale function de-means its argument and scales it to have unit variance. These operations don’t change the correlation of a and b. But they do change the correlation of alpha and beta:

alpha = c(scale(a), scale(b))
beta = c(scale(b), scale(a))

cor(alpha, beta)

## [1] 0.4326075

Now the two correlations agree!

I came across this phenomenon while writing my previous post, in which I discuss the degree assortativity among nodes in Zachary’s (1977) karate club network. One way to measure this assortativity is to use the degree_assortativity function in igraph:

library(igraph)

G = graph.famous('Zachary')

assortativity_degree(G)

## [1] -0.4756131

This function returns the correlation of the degrees of adjacent nodes in G. Another way to compute this correlation is to

construct a matrix el in which rows correspond to edges and columns list incident nodes;
define the vectors d1 and d2 of degrees among the nodes listed in el;
compute the correlation of d1 and d2 using cor.

Here’s what I get when I take those three steps:

el = as_edgelist(G)

d = degree(G)
d1 = d[el[, 1]]  # Ego degrees
d2 = d[el[, 2]]  # Alter degrees

cor(d1, d2)

## [1] -0.4769563

Notice that cor(d1, d2) disagrees with the value of assortativity_degree(G) computed above. This is because the vectors d1 and d2 have different means and variances:

c(mean(d1), mean(d2))

## [1] 7.487179 8.051282

c(var(d1), var(d2))

## [1] 25.94139 32.23110

These differences come from el listing each edge only once: it includes a row c(i, j) for the edge between nodes $i$ and $j\not=i$, but not a row c(j, i). Whereas assortativity_degree accounts for edges being undirected by adding the row c(j, i) before computing the correlation. This is analogous to the “append the mirrored data” step I took to create $(\alpha_i,\beta_i)_{i=1}^{2n}$ above. Appending the mirror of el to itself before computing cor(d1, d2) returns the same value as assortativity_degree(G):

el = rbind(
  el,
  matrix(c(el[, 2], el[, 1]), ncol = 2)  # el's mirror
)

d1 = d[el[, 1]]
d2 = d[el[, 2]]

c(assortativity_degree(G), cor(d1, d2))

## [1] -0.4756131 -0.4756131

The friendship paradox

Wed, 16 Nov 2022 00:00:00 +0000

People tend to be less popular than their friends. This paradox, first observed by Feld (1991), is due to popular people appearing on many friend lists.

For example, consider the social network among members of a karate club studied by Zachary (1977):

The network contains $n=34$ nodes with mean degree $$\DeclareMathOperator{\Corr}{Corr} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{Var} \E[d_i]\equiv\frac{1}{n}\sum_{i=1}^nd_i=4.59,$$ where $\E$ takes expected values across nodes and $d_i$ is the degree of node $i$. If $N_i$ denotes the set of $i$'s neighbors, then the mean degree among those neighbors equals $$f_i\equiv \frac{1}{d_i}\sum_{j\in N_i}d_j.$$ The friendship paradox states that $\E[d_i]\le\E[f_i]$ in any network. In Zachary’s network we have $\E[f_i]=9.61$, about twice the mean degree.

We can approximate $\E[f_i]$ using the following procedure:

Choose a stub (i.e., the endpoint of an edge) uniformly at random.
Record the degree of the chosen stub.

Repeating these steps many times yields a degree distribution that over-samples from high-degree nodes. The mean of this distribution answers the following question: “How many friends does a typical friend have?” The probability of choosing node $i$ in the first step equals $$p_i\equiv \frac{d_i}{\sum_{j=1}^nd_j},$$ the proportion of stubs that $i$ adds to the network. Using the probabilities $p_i$ to compute the expected value of the degrees $d_i$ yields an approximation $$\begin{align} \widehat{\E[f_i]} &= \sum_{i=1}^np_id_i \\ &= \sum_{i=1}^n\left(\frac{d_i}{\sum_{j=1}^nd_j}\right)d_i \\ &= \frac{\sum_{i=1}^nd_i^2}{\sum_{j=1}^nd_j} \\ &= \frac{\E[d_i^2]}{\E[d_i]} \\ &= \E[d_i]+\frac{\Var(d_i)}{\E[d_i]} \end{align}$$ of $\E[f_i]$. Notice that if $\Var(d_i)=0$ then $\widehat{\E[f_i]}=\E[d_i]$; in that case, everyone has the same degree as their friends, and so there is no friendship paradox. The difference between the mean degree $\E[d_i]$ and the typical friend’s degree $\widehat{\E[f_i]}$ grows as the variance in degrees grows.

The approximation $\widehat{\E[f_i]}$ is closest to $\E[f_i]$ when there is no assortative mixing with respect to degree. Then the $d_i$ are uncorrelated with the $f_i$. But this isn’t true in Zachary’s network:

Indeed, in Zachary’s network we have $\widehat{\E[f_i]}=7.77$, which is smaller than the true value $\E[f_i]=9.61$. To see why, notice that $$\begin{align} \E[d_if_i] &= \frac{1}{n}\sum_{i=1}^nd_if_i \\ &= \frac{1}{n}\sum_{i=1}^n\sum_{j\in N_i}d_j \\ &\overset{\star}{=}\frac{1}{n}\sum_{j=1}^nd_j^2 \\ &= \E[d_j^2], \end{align}$$ where $\star$ holds because $j$ appears in $d_j$ neighborhoods $N_i$. But $$\E[d_if_i]=\E[d_if_i]+\Cov(d_i,f_i)$$ by the definition of covariance, from which it follows that $$\widehat{\E[f_i]}=\E[f_i]+\frac{\Cov(d_i,f_i)}{\E[d_i]}.$$ Thus $\widehat{\E[f_i]}$ under-estimates $\E[f_i]$ in Zachary’s network because $\Cov(d_i,f_i)=-8.45$ is negative.

The value of $\widehat{\E[f_i]}$ depends only on the mean and variance of degrees, and not the correlation of degrees across adjacent nodes. Thus $\widehat{\E[f_i]}$ is invariant to degree-preserving randomizations (DPRs). But $\E[f_i]$ can vary under DPRs because they can change the correlation of adjacent nodes’ degrees. For example, consider the three networks shown below:

The networks $G_1$, $G_2$, and $G_3$ have the same degree distributions. As a result, they have the same mean degrees $\E[d_i]$ and approximations $\widehat{\E[f_i]}$ of $\E[f_i]$. But the true values of $\E[f_i]$ differ because the correlations $\Corr(d_i,f_i)$ differ:

Network	`$\E[d_i]$`	`$\widehat{\E[f_i]}$`	`$\E[f_i]$`	`$\Corr(d_i,f_i)$`
`$G_1$`	1.43	1.6	1.43	1.00
`$G_2$`	1.43	1.6	1.57	0.20
`$G_3$`	1.43	1.6	1.71	-0.91

The network $G_1$ is perfectly assortative with respect to degree, so $\widehat{\E[f_i]}$ over-estimates $\E[f_i]$. Whereas $G_3$ is dis-assortative with respect to degree, so $\widehat{\E[f_i]}$ under-estimates $\E[f_i]$. The network $G_2$ is relatively unsorted, so $\widehat{\E[f_i]}$ is close to $\E[f_i]$.

Binary distributions and risky gambles

Sun, 13 Nov 2022 00:00:00 +0000

This post shows how binary random variables can be defined by their mean, variance, and skewness. I use this fact to explain why variance does not (always) measure “riskiness.”

Suppose I’m defining a random variable $X$. It takes value $H$ or $L<H$, with $\Pr(X=H)=p$. I want $X$ to have mean $\mu$, variance $\sigma^2$, and skewness coefficient $$\DeclareMathOperator{\E}{E} s\equiv\E\left[\left(\frac{X-\mu}{\sigma}\right)^3\right].$$ The target parameters $(\mu,\sigma,s)$ uniquely determine $(H,L,p)$ via $$\begin{align} H &= \mu+\frac{s+\sqrt{s^2+4}}{2}\sigma \\ L &= \mu+\frac{s-\sqrt{s^2+4}}{2}\sigma \\ p &= \frac{2}{4+s\left(s+\sqrt{s^2+4}\right)}. \end{align}$$

For example, if I want $X$ to be symmetric (i.e., to have $s=0$) then I have to choose $(H,L,p)=(\mu+\sigma,\mu-\sigma,0.5)$. Increasing the target skewness $s$ makes the upside $(H-\mu)$ larger but less likely, and the downside $(\mu-L)$ smaller but more likely:

This mapping between $(\mu,\sigma,s)$ and $(H,L,p)$ is useful for generating examples of “risky” gambles. Intuition suggests that a gamble is less risky if its payoffs have lower variance. But Rothschild and Stiglitz (1970) define a gamble $A$ to be less risky than gamble $B$ if every risk averse decision-maker (DM) prefers $A$ to $B$. These two definitions of “risky” agree when

payoffs are normally distributed, or
DMs have quadratic utility functions.

Under those conditions, DMs’ expected utility depends only on the payoffs’ mean and variance. But if neither condition holds then DMs also care about payoffs’ skewness. We can demonstrate this using binary gambles. Consider these three:

Gamble $A$'s payoffs have mean $\mu_A=10$, variance $\sigma_A^2=36$, and skewness $s_A=0$;
Gamble $B$'s payoffs have mean $\mu_B=10$, variance $\sigma_B^2=144$, and skewness $s_B=5$;
Gamble $C$'s payoffs have mean $\mu_C=10$, variance $\sigma_C^2=9$, and skewness $s_C=-3$.

The means are the same but the distributions are different. Gamble $i\in\{A,B,C\}$ gives me a random payoff $X_i$, which equals $H_i$ with probability $p_i$ and $L_i$ otherwise. We can compute the $(H_i,L_i,p_i)$ using the target parameters $(\mu_i,\sigma_i,s_i)$ and the formulas above:

Gamble `$i$`	`$H_i$`	`$L_i$`	`$p_i$`
`$A$`	16.00	4.00	0.50
`$B$`	72.31	7.69	0.04
`$C$`	10.91	0.09	0.92

Gamble $A$ offers a symmetric payoff: its upside $(H_A-\mu_A)$ and downside $(\mu_A-L_A)$ are equally large and equally likely. Gamble $B$ offers a positively skewed payoff: a large but unlikely upside, and a small but likely downside. Gamble $C$ offers a negatively skewed payoff: a small but likely upside, and a large but unlikely downside.

These upsides and downsides affect my preferences over gambles. Suppose I get utility $u(x)\equiv\log(x)$ from receiving payoff $x$. Then gamble $A$ gives me expected utility $$\begin{align} \E[u(X_A)] &\equiv p_Au(H_A)+(1-p_A)u(L_A) \\ &= 0.5\log(16)+(1-0.5)\log(4) \\ &= 2.08, \end{align}$$ while $B$ gives me $\E[u(X_B)]=2.12$ and $C$ gives me $\E[u(X_C)]=1.99$. So I prefer gamble $B$ to $A$, even though $B$'s payoffs have four times the variance of $A$'s. I also prefer $B$ to $C$, even though $B$'s payoffs have sixteen times the variance of $C$'s. How can I be risk averse—that is, have a concave utility function—but prefer gambles with higher variance? The answer is that I also care about skewness: I prefer gambles with large upsides and small downsides. These “sides” of risk are not captured by variance.

So is gamble $C$ “riskier” than gambles $A$ and $B$? Rothschild and Stiglitz wouldn’t say so. To see why, suppose my friend has utility function $v(x)=\sqrt{x}$. Then gamble $A$ gives him expected utility $\E[v(X_A)]=3$, while $B$ gives him $\E[v(X_B)]=2.98$ and $C$ gives him $\E[v(X_C)]=3.05$. My friend and I have opposite preferences: he prefers $C$ to $A$ to $B$, whereas I prefer $B$ to $A$ to $C$. But we’re both risk averse: our utility functions are both concave! Thus, it isn’t true that every risk-averse decision-maker prefers $A$ or $B$ to $C$. Different risk-averse DMs have different preference rankings. This makes the three gambles incomparable under Rothschild and Stiglitz’s definition of “risky.”

Estimating treatment effects with OLS

Sat, 12 Nov 2022 00:00:00 +0000

A crop farmer wonders if he should use a new fertilizer. He asks his peers what fertilizer they use and what are their annual yields. He notices that some have different soil. “That’s annoying,” the farmer thinks. “If we all had the same soil, then I could estimate the benefit of using the new fertilizer by comparing the mean yields among farmers who do and don’t use it. But now I have to control for soil too!”

Thankfully the farmer learned about ordinary least squares in his youth. He remembers that he can control for variables by including them in a regression equation. He posits a linear model $$\text{yield}=\beta_1\text{fert}+\beta_2\text{soil}+\epsilon,$$ where

$\text{fert}$ indicates using the new fertilizer,
$\text{soil}$ indicates having a different soil,
$\beta_1$ and $\beta_2$ are the average marginal effects of changing fertilizers and soils, and
$\epsilon$ is an iid random error.

The farmer estimates $\beta_1$ and $\beta_2$ using OLS, and gets the following results:

Coefficient	Estimate	Std. error
`$\beta_1$`	0.787	0.210
`$\beta_2$`	1.013	0.211

The farmer’s daughter enters his office. She looks at his estimates and asks, “why don’t you just compare the mean yields among farmers with the same soil as you? That seems less complicated than OLS.” The farmer agrees. He computes the conditional means $$\mu_{10}\equiv\mathrm{E}[\text{yield}\mid\text{fert}=1\ \text{and}\ \text{soil}=0]$$ and $$\mu_{00}\equiv\mathrm{E}[\text{yield}\mid\text{fert}=0\ \text{and}\ \text{soil}=0]$$ in his data, and finds that $\mu_{10}-\mu_{00}=0.965$. This surprises the farmer: “I thought OLS controlled for variation in soil. I expected it to give me the same result as computing the difference in conditional means. But it doesn’t. Why not?”

The farmer has an idea: “What if I include an interaction term?” He posits an extended model $$\text{yield}=\gamma_1\text{fert}+\gamma_2\text{soil}+\gamma_3(\text{fert}\cdot\text{soil})+\epsilon,$$ estimates it via OLS, and gets the following results:

Coefficient	Estimate	Std. error
`$\gamma_1$`	0.965	0.290
`$\gamma_2$`	1.208	0.303
`$\gamma_3$`	-0.377	0.422

“Interesting,” he thinks. “OLS gives me the difference in conditional means if I include an interaction term, but not if I don’t. I wonder what’s going on?”

What’s going on is that $\beta_1$ and $\gamma_1$ measure different things. The latter measures the average effect of using the new fertilizer without changing soils. Thus $\gamma_1=\mu_{10}-\mu_{00}$ by definition. Whereas $\beta_1$ measures the average effect of using the new fertilizer across all soils. Thus $$\beta_1=(1-p)\left(\mu_{10}-\mu_{00}\right)+p\left(\mu_{11}-\mu_{01}\right),$$ where $p=\Pr(\text{soil}=1)$ is the share of the farmer’s peers who have a different soil, and $$\mu_{fs}\equiv\mathrm{E}[\text{yield}\mid\text{fert}=f\ \text{and}\ \text{soil}=s]$$ is the mean yield among peers with $\text{fert}=f\in\{0,1\}$ and $\text{soil}=s\in\{0,1\}$. The farmer’s data has $p=0.47$ and $\mu_{11}-\mu_{01}=0.587$, giving $$\beta_1=(1-0.47)\times0.965+0.47\times0.587=0.787$$ as in the first table above.

The OLS estimates of $\beta_1$ and $\gamma_1$ differ whenever the effect of using the new fertilizer varies across soils; that is, whenever $\gamma_3\not=0$ in the true model. But they can also differ when $\gamma_3=0$ due to sampling variation. For example, suppose the true model is $$\text{yield}=\text{fert}+\text{soil}+\epsilon,$$ where $\text{fert}$ and $\text{soil}$ are independent, and where $\epsilon$ is iid normally distributed. The differences $(\mu_{10}-\mu_{00})$ and $(\mu_{11}-\mu_{01})$ in conditional means can differ in small samples because $\text{soil}$ and $\epsilon$ can be correlated by chance. But this spurious correlation disappears as the sample grows, making $\beta_1$ and $\gamma_1$ converge. I demonstrate this convergence in the table below. It shows the mean absolute difference between $\beta_1$ and $\gamma_1$ across many samples of increasing size $n$:

`$n$`	`$\mathrm{E}\left[\lvert\beta_1-\gamma_1\rvert\right]$`
100	0.160
1,000	0.050
10,000	0.014

Thanks to Anirudh Sankar for reading a draft version of this post.

Why do experts give simple advice?

Sun, 25 Sep 2022 00:00:00 +0000

One of the requirements for my PhD program is to write a “second-year paper.” You can read mine here. It discusses how career concerns impact the type of advice that experts provide. I consider two types of advice:

“Simple” advice of the form “take this action;”
“Complex” advice of the form “take this action under these conditions.”

Including conditions makes the expert seem more confident his advice is correct. This hurts his reputation if his advice turns out to be incorrect. Then the advisee infers that the expert is incompetent. She says, “most wrong experts are incompetent. You’re wrong, so you’re probably incompetent. You’re fired!”

The expert can avoid this fate by “simplifying” his advice: by excluding relevant conditions. This makes the advice worse but prevents the advisee from learning about the expert’s competence. It insures him against the risk of losing his job.

The paper formalizes this argument. It explores how the expert’s choice between simple and complex advice depends on his incentives. It explains my answer to the titular question: experts give simple advice to avoid being “confidently wrong.”

Dollar cost averaging

Sat, 17 Sep 2022 00:00:00 +0000

Dollar cost averaging (DCA) is a way to split a lump sum investment into many smaller investments. It involves regular purchases of a fixed value (rather than quantity) of shares. This leads to buying more shares when their price is low and fewer when their price is high. DCA is less risky than investing the lump sum because:

it reduces the chance of buying lots of shares before their price rises or falls;
it reduces the time that invested cash spends earning capital gains and losses.

But DCA is also less rewarding if prices trend upward because uninvested cash does not earn capital gains. In that case, choosing between DCA and lump sum investment requires trading off risks and rewards.

For example, suppose I have some cash to invest in a market index: the S&P 500. Here’s how that index evolved over the past five years (based on week-closing values from FRED):

The index grew overall, with a sharp drop at the start of the pandemic and slower drop at the start of this year. The weekly return fluctuated around a mean of 0.2%:

Let’s assume future weekly returns will follow this distribution. Should I invest all my cash now (the “lump sum” strategy) or split it into equal weekly investments (the “weekly DCA” strategy)? How about equal monthly investments (the “monthly DCA” strategy)?

We can answer these questions via simulation:¹

Sample 52 values from the S&P 500’s weekly return distribution.
Take the cumulative product of those returns to get a simulated price path.
Divide the cash invested each week by the simulated price for that week to get the number of shares bought that week.
Multiply the total number of shares bought by the price in the 52nd week to get the investments’ final value.
Divide the final value by the amount of cash invested to get the annual return.

Repeating these five steps many times yields a distribution of annual returns offered by each strategy. I compare those distributions in the table below, based on 1,000 simulated price paths.

Strategy	Mean	Std. dev.	Min.	Max.
Lump sum	11.8%	22.2%	-46.1%	95.0%
Weekly DCA	5.6%	11.9%	-26.1%	56.2%
Monthly DCA	6.0%	12.5%	-27.0%	57.3%

The return on the lump sum strategy has the highest mean and variance. Investing all my cash in the first week gives me more time “in the market” earning capital gains, but exposes me to lots of random gains and losses. Investing in smaller chunks limits my exposure to gains and losses, narrowing the distribution of annual returns.

So, should I dollar cost average or not? The answer depends on my risk tolerance. If I don’t care about risk then I should choose the strategy with the highest mean return. But if I’m risk averse then I need to paid a risk premium. The more risk averse I am and the riskier the strategy, the higher the risk premium. I should choose the strategy with the highest return net of its risk premium. This net, “certainty-equivalent” (CE) return equals the return on a riskless strategy that makes me indifferent between using it and using the risky strategy.

For example, the chart below plots the CE return on each strategy when I have constant relative risk aversion. When my risk aversion is low, I prefer investing the lump sum. But when my risk aversion is high, I prefer investing in smaller chunks. Weekly and monthly chunks appear to deliver similar CE returns in my simulations.

The risk aversion level that makes me prefer DCA depends on the asset I invest in. For example, suppose I’d rather invest in bitcoin. Its recent prices were much more volatile than the S&P 500 (according to week-closing values from Yahoo Finance):

Investing in bitcoin offered a mean weekly return of 1.3% in the past five years, six times that of the S&P 500. But bitcoin’s returns were riskier: they had a standard deviation of 11.0%, whereas the S&P 500’s returns had a standard deviation of 2.8%.

The chart below compares the lump-sum, weekly DCA, and monthly DCA strategies for investing in bitcoin. It shows the certainty-equivalent return on each strategy, based on 1,000 price paths simulated using the five steps described above. My decision rule is the same as when investing in the S&P 500: use DCA if I’m sufficiently risk averse. But the “sufficient” level of risk aversion for bitcoin is lower than for the S&P 500. This is because bitcoin is riskier: its risk premium is a larger share of its mean return.

One benefit of DCA that my simulations don’t capture is its simplicity: I don’t have to think about when to invest the lump sum. Indeed DCA removes the temptation to time the market that leads many investors astray.

Disclaimer: I am not a financial advisor and this post is not financial advice. Do you own research on the investments that feel right to you. Don’t invest money you can’t afford to lose.

I assume my uninvested cash earns interest at the inflation rate. This means I can treat the simulated prices as real. I also assume there are no transaction costs or brokerage fees. ↩︎

Homophily and the strength of moderate ties

Fri, 16 Sep 2022 00:00:00 +0000

Yesterday Science published a study on social networks and job mobility. It suggests that there’s a causal, “inverted U-shaped” relationship between

the number of mutual friends you share with someone and
the probability that befriending them leads you to change jobs.

The authors call for a theory to explain this relationship. In fact it has a simple explanation: homophily.

People tend to have friends with similar interests. Those interests influence the jobs we want and hear about. If you and I have lots of mutual friends, then I probably hear about lots of job opportunities that interest you. But you probably hear about those opportunities too because you talk to the same people and follow the same news sources. So befriending me is unlikely to impact your job mobility because the information I could give you is redundant.

The opposite is true if we have few mutual friends. Then we probably hear about different job opportunities because we talk to different people and follow different news sources. But few of the opportunities I hear about will interest you. So befriending me is unlikely to impact your job mobility because the information I could give you is irrelevant.

Thus, homophily creates a trade-off between relevance and redundancy. Befriending “strong ties” (i.e., people with lots of mutual connections) provides information that is relevant but redundant. Befriending “weak ties” (i.e., people with few or no mutual connections) provides information that is irrelevant but novel. Befriending “moderate” ties balances relevance and redundancy. It lets you hear about opportunities you find interesting and wouldn’t hear otherwise.

If everyone acts on those opportunities, then we should see the relationship suggested by the Science study.

Reflections on grad school: Years 1 and 2

Tue, 06 Sep 2022 00:00:00 +0000

This post reflects on the first two years of my economics PhD at Stanford University. I discuss my first- and second-year coursework, and my quality of life as a grad student.

First-year courses

I spent the first year taking the “core” micro, macro, and econometrics courses. Most of their content was familiar from undergrad. Some of my classmates waived out of courses they’d taken before. I didn’t because I hadn’t. I also wanted to be “in sync” with my classmates: working on the same problems, facing the same stresses, and celebrating the same milestones.

I found macro the most rewarding and metrics the least. In macro I learned how to solve dynamic optimization problems. I used that skill in a blog post and term paper on non-macro topics. In contrast, I’m about as good at econometrics as I was before starting my PhD. I know more ways to compute standard errors, but I’ve still never run a diff-in-diff.

Most of our assessment was via problem sets. They tended to focus on technical minutiae rather than fundamental insights. I seldom found them educational. Working in groups made them less educational. In theory, group-work involved discussions that helped everyone learn. In practice, it involved “dividing and conquering:” splitting problems among group members to work on alone.

We had no qualifying or in-person exams. Some courses had final assignments, but we did them at home. So I saw no reason to study. Instead I waited until we got our assignments and learned only what I needed. I didn’t want to waste time studying topics I didn’t care about. And I never forgot what the department chair said on our first day:

“Grades don’t matter. What matters is whether you do good research.”

One consequence of not studying was ending the year feeling like I knew less economics. I gained more awareness than knowledge, so my ratio of “known knowns” to “known unknowns” fell. But awareness is still useful: I know what keywords to search if I need to learn something in the future.

Second-year courses

I kept taking courses in my second year. But I got to choose my courses based on the fields I chose to specialize in. I chose micro theory and behavioral economics. Some of my reasons were:

I like studying simple models of how people behave and interact;
I’d rather argue about modeling assumptions than external validity;
Theory and (especially) behavioral courses had the fewest assessments.

I was scared about my career: the job market for theorists, especially behavioral theorists, is notoriously awful. But I was more scared of doing research I didn’t enjoy. That said, I viewed field choices as administrative only. They didn’t have to confine my research. Indeed I published a paper in my second year that was neither theoretical nor behavioral.

I also had to take “distribution” courses outside my chosen fields. Mine were on market design, political economy, and economic history. I attended some sociology classes because I wanted to meet non-economists who shared my interests. I met some non-economists (including some who were anti-economists), but none shared my interests. They also had very different definitions of “theory.” But I enjoyed hearing their perspectives.

The purpose of the second year was to help us transition from being research consumers to producers. Our assessments reflected that purpose. They included referee reports, research proposals, and term papers. Proposals were helpful for organizing and clarifying my ideas. They weren’t helpful for prompting feedback: I submitted six proposals and got comments once. Instead I got feedback from discussions with professors and classmates. Those discussions made going to class worthwhile.

The best discussions were with people who challenged me to think harder. For example, some professors were known to ask hard questions when students shared their ideas. At first those professors seemed “scary.” Eventually I realized that what made them scary was that they assumed I was intelligent. They wouldn’t let me make hand-wavy arguments or think lazily. I learned to admire those professors and gravitated to them. Sometimes they told me my ideas were shallow or wrong. But I’d rather be wrong in class than in print.

Quality of life

People warned me that grad students have no free time. That has not been my experience. I’ve had plenty of time to exercise, blog, and be unproductive. I had that time because I chose to minimize my coursework. I made that choice because (i) grades don’t matter (see above), and (ii) I saw coursework as a barrier to doing research and enjoying my life.

People also warned me that grad students live in poverty. Again, that has not been my experience. Stanford pays enough that I can dine out occasionally (even at Palo Alto prices), and can eat more than beans and rice at home. I can replace my running shoes and socks when they wear out. I don’t have to worry about hospital bills. I feel privileged rather than poor. Campus housing is expensive, but Stanford deducts rent from my stipend so I don’t notice they’re ripping me off.

In hindsight, I under-appreciated local amenities when I applied to PhD programs. My other options were in Boston, Chicago, and New York. Stanford definitely wins on the weather front: it’s always warm and dry here. We don’t have Chicago’s bitter winters or the east coast’s humid summers. I can go outside to unwind whenever I like. If I couldn’t then I’d go insane.

But Stanford loses on the “fun place to live” front. Palo Alto is small and suburban. It lacks the energy and excitement found in big cities. San Francisco is an hour away by train, which is fine for days out but a hassle for nights out. I prefer running to drinking, so I’m willing to sacrifice bars for sun. But that preference is endogenous.

Paying for the truth

Thu, 01 Sep 2022 00:00:00 +0000

In a previous post, I showed that if the truth doesn’t matter then I’m better off being an ideologue with ideological friends. I discussed the trade-off between (i) experiencing reality and (ii) experiencing what my friends experience. Truth-seeking made sense only when the benefit of (i) exceeded the cost of forgoing (ii). This post discusses another cost of truth-seeking: having to pay—financially, cognitively, or emotionally—for information.

One way to model that cost is as follows.¹ Suppose the truth is determined by a random variable $\theta\in\{0,1\}$. I learn about $\theta$ by observing a signal $s(x)\in\{0,1\}$ with precision $$\Pr(s(x)=\theta)=\frac{1+x}{2}.$$ The parameter $x\in[0,1]$ determines the signal’s quality. If $x=1$ then the signal is fully informative; if $x=0$ then it is uninformative.

My prior estimate $\theta_0\in[0.5,1]$ of $\theta$ is based on no information; it reflects my ideology. I use the realization of $s(x)$ and my prior $\theta_0$ to form a posterior estimate $$\hat\theta(s(x))=\Pr\left(\theta=1\,\vert\,s(x)\right)$$ via Bayes’ rule. I care about the mean squared error $$\newcommand{\E}{\mathrm{E}} \newcommand{\MSE}{\mathrm{MSE}} \MSE(x)=\E\left[\left(\theta-\hat\theta(s(x))\right)^2\right]$$ of my posterior estimate, where $\E$ is the expectation operator taken with respect to the joint distribution of $\theta$ and $s(x)$ given my prior $\theta_0$. But I also care about the cost $cx$ I endure from observing a signal of quality $x$. This cost reflects the resources I use to seek the information and process it (e.g., money, time, and mental energy). I choose the quality $x^*$ that minimizes $$f(x)=\MSE(x)+cx.$$ The chart below plots my objective $f(x)$ against $x$ when I have prior $\theta_0\in\{0.5,0.7,0.9\}$ and face marginal cost $c\in\{0,0.1,0.2,0.3\}$. Since $f$ is concave in $x$, it has (constrained) local minima at $x=0$ and $x=1$. My choice between these minima depends on the value of $c$. If it’s small then information is cheap and I “buy” as much as I can. If it’s large then information is expensive and I don’t buy any. But there’s no middle ground: I seek all the truth or none of it.

Let $c^*$ be the threshold value of $c$ at which I stop paying for information: the “choke price” of truth. How does $c^*$ depend on my prior $\theta_0$? Intuitively, increasing $\theta_0$ has two competing effects:

it increases the error in my posterior estimate when $\theta=0$;
it increases my confidence that $\theta=1$.

The first effect makes me want more information, increasing $c^*$. The second effect makes me think I need less information, decreasing $c^*$. The chart below shows that the second effect dominates. The more ideological I am about the value of $\theta$, the cheaper the truth must be for me to seek it. If I’m a pure ideologue (i.e., $\theta_0=1$) then I won’t seek the truth even if it’s free.

One reason the first effect might dominate is if I care about errors when $\theta=0$ more than when $\theta=1$. For example, if $\theta$ indicates whether it will be sunny then I’d rather bring an umbrella I don’t use than be caught wearing flip-flops in the rain. I can capture that asymmetry by replacing the MSE component of my objective with a weighted version $$\newcommand{\WMSE}{\mathrm{WMSE}} \WMSE(x)=\E\left[W(\theta)\cdot\left(\theta-\hat\theta(s(x))\right)^2\right],$$ where the weighting function $$W(\theta)=\begin{cases} 1 & \text{if}\ \theta=1 \\ w & \text{if}\ \theta=0 \end{cases}$$ has $w\ge1$. Increasing $w$ nudges my optimal posterior estimate towards zero because I want to avoid being “confidently wrong” when $\theta=0$. Since $\WMSE(x)$ is concave in $x$, I still optimally pay for all the truth or none of it. But now the choke price $c^*$ at which I stop paying for the truth depends on my prior $\theta_0$ and the error weight $w$.

The chart below shows that $c^*$ is non-monotonic in $\theta_0$ when $w$ is large. This is due to the two competing effects described above. The first effect dominates when $w$ is large and my prior is low. In that case, it’s really bad to be wrong and I’m not confident I’ll be right. Whereas the second effect dominates when $w$ is large and my prior is high. In that case, I’m so confident I’ll be right that I don’t care what happens if I’m wrong.

This example raises a philosophical question: what does it mean for the estimate to be “wrong?” For example, suppose I thought there was a 30% chance of rain. If it rained, was I wrong? What if I thought there was a 5% chance? A 95% chance? Where should I draw the line? On those questions, I recommend Michael Lewis’ discussion with Nate Silver about 17 minutes into this podcast episode.

See here for my discussion of the case when $\theta$ and $s$ are normally distributed. ↩︎

Why should academics blog?

Sun, 31 Jul 2022 00:00:00 +0000

I published my first blog post in March 2018. Since then I’ve spent countless hours planning, drafting, and editing other posts. Academics might think those hours were wasted: “Why write blog posts when you could write research papers? Blogging won’t get you citations or tenure!” But I disagree with that criticism. Blogging complements my research rather than substitutes for it. Here are seven reasons:

Blogging can lead to papers. My post on policymaking under uncertainty inspired Arthur Grimes and my paper on COVID-19 lockdowns. Blogging about nberwp meant I understood the data and context enough to write my paper on gender sorting among economists. Discussing the idea for a post with Adam Jaffe led to our paper on research funding and collaboration.
Blogging increases my idea turnover. I have lots of research ideas. Some are worth pursuing and some are dead ends. I sort ideas by “testing” them: writing down toy models or exploring relevant data. Blogging lets me run those tests quickly and casually. It also lets me share my tests with readers. They can avoid dead ends I’ve reached, or salvage ideas I’ve abandoned if they see opportunities I don’t.
Blogging promotes a creator mindset. When I encounter a new idea, one of my first thoughts is “how could I make that a blog post?” Blogging nudges me to think like a creator; to view ideas as opportunities to write something valuable. It also nudges me to focus on output as the source of value. No matter how long I spend writing posts, no one can read and benefit from them if they’re still on my computer. The goal is to publish. Academics have a similar goal.
Blogging improves my writing. It gives me practice refining my ideas, (re)structuring arguments, and thinking about my audience. Writing papers gives me similar practice, but blogging yields the benefits faster because I can write blog posts faster.
Blogging helps me learn. Most of my posts come from wanting to understand something. Sometimes it’s a problem encountered in my research (e.g., dyadic dependence or selection bias). Sometimes it’s a result from others’ research (e.g., on information gerrymandering or modeling human predictions). Sometimes it’s a technical paper (e.g., on communicating science or research incentives). Writing blog posts makes me engage with ideas and explain them in my own words.
Blogging helps me connect ideas. Many of my posts build on previous posts. Sometimes this is clear in advance (as with, e.g., my posts on stable matchings with noisy and correlated preferences). Sometimes I realize the connection between posts while writing them. I love discovering how ideas are connected—indeed I’ve blogged about that here and here—and view it as an essential research skill. Blogging helps me practice that skill.
Blogging is fun. (Yes, academics can have fun!) I enjoy thinking and writing. Blogging is a way to think and write. Most important, I can think and write about whatever I like—I don’t have to focus on topics that academics care about. I can blog about birds, gift exchanges, and running negative splits. I can even blog about Pokémon! And I get the benefits of thinking and writing without the pressure of academic evaluation.

What's it like living in America?

Tue, 26 Jul 2022 00:00:00 +0000

Last month I visited New Zealand for the first time since moving to the USA. Lots of people asked what living in the USA is like. Here’s what I told them:

It’s always sunny in Palo Alto

I live in Palo Alto, California. It’s near San Francisco and part of Silicon Valley. Palo Alto is officially a “city,” but it feels suburban: the streets are clean, there are trees everywhere, and most buildings are one or two stories.

Palo Alto has two main attractions. One is the weather: the air is warm and dry, it seldom rains, and it never snows. I don’t feel guilty about spending a nice day inside because every day is a nice day. (In fact I have the opposite problem: I spend too much time outside running and cycling, and too little inside being productive.)

The other attraction is Stanford University, where I study. Most of my interactions are with students and professors. Yet Palo Alto doesn’t feel like a university town: it’s easier to find ice cream than beer, and everything is expensive. The rent on my studio apartment here is more than what I paid for a two-bedroom apartment in the middle of Wellington.

Palo Alto is culturally diverse. I hear foreign languages daily. Most of my friends here are from South America or Europe. Despite being used to hearing different accents, people have trouble with mine: they often think I’m named “Bin.” (We all feel embarrassed when I have to spell my name, one of the simplest in the English language.)

In contrast, Palo Altoans seem politically homogeneous. The Americans I know all vote Democrat; the non-Americans would if they could. But people signal their politics in different ways. Some decorate their lawns with “climate change is real” and “black lives matter” signs. Others wear masks while walking alone outside or cycling without a helmet.

People here seem aware of, and concerned about, social issues plaguing the USA. But they’re also insulated from such issues. There are no riots. There are few homeless and no (visible) guns. No one looks obese.

Clearly Palo Alto is not representative of the USA. Other areas have skyscrapers, snow, cheap drinks, Republicans, climate change deniers, and openly carried guns. I’ve visited some cities on the east and west coasts, but nowhere in the south and almost nowhere rural. So my perspective on living here is biased because my experience is biased.

But I don’t think anywhere is representative of the USA. There’s so much variety in where people live, how they behave, and what they believe. I didn’t appreciate that variety until moving here. I thought of the USA similarly to New Zealand: I thought everyone was basically the same, with minor differences in wealth and lifestyle. I thought wrong.

How’s it different?

New Zealand and the USA differ in many ways. Here are some of my observations:

Dining out

When I read a restaurant menu in New Zealand, the price I see is the price I pay. When I read one here, the price I see is about 80% of what I pay. The last 20% comprises taxes and tips. Taxes vary by product, store, and state. Tips vary by (perceived) service quality and social norms.

The norm in Palo Alto is to tip 18% of the pre-tax price. I’m not sure why the fee for shipping items from kitchen to table depends on the price of the cargo. One local menu reads:

We are a no tipping establishment. 20% service charge will be added to your bill to ensure a better living wage to our staff.

They could just raise their pre-tax prices by 20%, but then they wouldn’t get to virtue signal. At least they prompt people to multiply by 1.2 before choosing what to order.

Paying taxes

I pay income tax to the state and federal governments. I use third-party software to avoid the risk of committing fraud by mistake. That risk exists because both governments already know my taxable income. They could, like New Zealand’s tax office, just fill out my return and have me spend two minutes confirming it. But then I wouldn’t be intimidated into paying an intermediary to organize my financial data. I’m fortunate in that Stanford pays on my behalf. Others in the USA are less fortunate.

Healthcare

Stanford also pays for my health insurance. I’d hate to be uninsured: I broke my wrist last year, and my hospital and surgery fees totalled just under 100,000 USD (currently about 160,000 NZD). But I had surgery just three days after my accident; in New Zealand I’d have paid almost nothing but waited weeks. Sometimes you get what you pay for.

(As an aside: My surgeon prescribed oxycodone, a pain-relieving opiate. I paid 1.25 USD for 20 days’ worth. That payment helped me understand why the USA has an opioid epidemic.)

Talking to strangers

In New Zealand it is (mostly) socially acceptable to talk to strangers. People trust each other. If you’re approached by someone you don’t know, they probably don’t want anything from you (other than, say, directions). They usually just want to chat.

In the USA, it seems (mostly) socially unacceptable to talk to strangers. People don’t trust each other. If you’re approached by someone you don’t know, they probably want something from you. They might want to chat, but only to build rapport before advancing their agenda.

Talking generally

Americans (and other non-New Zealanders) use “how are you” as a greeting rather than a question. They don’t actually want to know; they just want you to say “good” or “fine,” and move on. Replying “bad” would make the greeter’s day worse. I hear “how are you” most often when walking past people I know. They usually don’t stop and wait for an answer. I’m still learning not to take offense.

Likewise I’m still adjusting to how Americans receive thanks. New Zealanders always reply “you’re welcome.” Americans always reply “sure” or “of course.” I find those responses dismissive and rude. They suggest my thanks were unnecessary and I’ve wasted my time offering them. Whereas I think Americans want to avoid a sense of reliance: they don’t want me to think I owe them anything in return.

Scenery

Most people I’ve met know New Zealand is scenic and beautiful. Scenery in the USA—at least, the scenery I’ve seen—is not as beautiful. But the USA has something New Zealand doesn’t: scale. Redwoods are huge. Big Sur is huge. New York City’s skyline is huge. New Zealand’s main scenic attractions—glaciers, lakes, and national parks—are not as huge.

Research incentives and the evolution of knowledge

Fri, 22 Jul 2022 00:00:00 +0000

Research is a cumulative process. New discoveries build on previous discoveries: researchers “stand on the shoulders of giants.” Carnehl and Schneider (2022) embed this idea in a model of how knowledge evolves. In their model, knowledge is the set of questions with known answers and research is the process of finding answers. The model has three main features:

Existing knowledge determines the benefits and costs of research.
Answering a question sheds light on related questions.
Researchers are free to choose which questions to ask and how intensely to seek answers.

The authors first discuss the social benefit of research. They think of society as an agent who makes policy choices. These choices appear as questions: How much should we tax companies? How much should we subsidize healthcare? Society knows the answer to some questions but is uncertain about the answer to others. This uncertainty means society has to guess which policies are best. Research is beneficial insofar as it leads to better guesses. It does so through two channels:

It reveals the answer to the researched questions.
It lowers the uncertainty around answers to other questions.

Society is more certain about answers to questions that are “closer” to existing knowledge. Intuitively, knowing how much to tax companies tells you more about taxing households than about building rockets. Research removes more uncertainty for questions closer to those researched. Carnehl and Schneider measure the benefit of research as the total amount of uncertainty it removes.

Next, the authors compare the benefits of research that “deepens” and “expands” knowledge. They model questions as points on the real line and the “frontier” as the extremal points of existing knowledge. Research on questions between these extremal points deepens knowledge; research on questions beyond the frontier expands knowledge. The relative benefits of deepening and expanding depend on the gaps in existing knowledge. Deepening is more beneficial when gaps are large. This is because larger gaps leave more uncertainty to remove. Splitting a large gap into smaller gaps removes more uncertainty than creating a new gap at the frontier.

Carnehl and Schneider then consider researchers’ choices: What questions do they ask? How intensely do they seek answers? These choices depend on the private benefits and costs of research. The authors assume private benefits equal social benefits. They also assume private costs rise with search intensity and existing uncertainty. More intense searches are more likely to succeed. But, for a given likelihood, more uncertain answers need wider searches. Carnehl and Schneider characterize researchers’ optimal choices in two dimensions:

“Novelty:” how far is the chosen question from existing knowledge?
“Output:” how likely is the research to succeed?

The relationship between novelty and output depends on whether the research expands or deepens knowledge. If it expands knowledge, then novelty and output are substitutes: more novel research is always riskier. If it deepens knowledge, then whether novelty and output are substitutes depends on the size of the gap being filled. This dependence is intricate—see the paper for details.

Finally, the authors use their model to study how researchers’ choices affect how knowledge evolves. Carnehl and Schneider’s key insight is that short- and long-run choices differ. Short-lived researchers choose questions that maximize private benefits less private costs. But they don’t consider the impact their choices have on future researchers’ choices. This impact arises from lowering the uncertainty for some questions but not others. Long-lived researchers internalize the impact their choices today have on choices tomorrow. The authors show that rewarding “moonshots”—research on questions more novel than myopically optimal—can raise the present value of future knowledge.

Overall, the paper is impressive. Its introduction gives a clear summary of the main results. The model is creative and crisp. Like all good models, it focuses on one issue—the cumulative nature of research—and abstracts from others—e.g., the priority system and career concerns. The paper is also a rare theoretical contribution to the (mostly empirical) literature on the economics of science.

Carnehl and Schneider’s model could be extended to acknowledge the replication crisis. Their model assumes all research findings are certain and true. But the crisis exists because some findings are false. We discover false findings via replication studies. These studies have zero novelty, but can still be beneficial: they remove uncertainty around findings we think are true.

Allowing for uncertain findings would then help us think about replication incentives. Some economists argue they need to be stronger—see, e.g., Zimmerman (2015). But whether to incentivize replication studies depends on the benefits they offer relative to original research. If society is confident a finding is true, then replicating it may be less beneficial than producing novel findings.

Truth-seekers and ideologues

Mon, 18 Jul 2022 00:00:00 +0000

People learn socially: they get information from their friends. Research on social learning takes as given that people want to learn the truth.¹ This assumption motivates worries about online misinformation: if your friends see something wrong and share it with you, then you might believe it and be wrong too.

But people share for more reasons than learning. Sometimes we share to feel connected: to let each other know we’re not alone in what we see. We enjoy having like-minded friends who have relatable experiences and validate ours. But if we only talk to like-minded friends then it’s hard to learn the truth because no one challenges our subjective experiences of objective reality.

Thus, when forming social networks, we face a trade-off. We want friends with similar experiences because they help us feel connected. But we also want friends with different experiences because they help us learn the truth. How we resolve this trade-off depends on how much we care about the truth. If we care a lot then we should choose friends with unbiased experiences; if we don’t care at all then we should choose friends who share our biases.

Here’s a basic model to illustrate. Imagine reality is chosen by a coin toss: Heads or Tails, each with probability 0.5. There are two types of people:

“Truth-seekers” try to see the world for what it is. But they do so noisily: their experience matches reality with probability $a>0.5$.
“Ideologues” always see the world the same way: they always experience Heads.

These types represent two extremes: truth-seekers have unbiased but noisy experiences, whereas ideologues have biased but precise experiences. I choose a friend to help me win one of two games:

In the “learning” game, I win if my friend’s experience matches reality.
In the “connecting” game, I win if my friend shares my experience.

I want to maximize my chance of winning the game we play. But I don’t know which we’ll play until I’ve chosen my friend. Which type should I choose?

If I’m a truth-seeker then I’m better off choosing a truth-seeking friend.² They’re better in the learning game because they’re more likely than ideologues to experience reality. They’re also better in the connecting game because we both tend to experience reality. Our pursuit for truth makes our experiences correlated. In contrast, ideologues’ indifference to the truth makes their experience uncorrelated with mine.

Things are different if I’m an ideologue. Then my best choice depends on how likely I am to play each game. Let $p$ be the probability I play the learning game. I’m better off choosing a truth-seeking friend if and only if $p$ exceeds³ $$\overline{p}\equiv \frac{1}{2a}.$$ Intuitively, I face a trade-off: Truth-seekers are better in the learning game for the same reason as above. But now ideologues are better in the connecting game because they always share my ideological experience. This trade-off tilts in favor of truth-seekers as their accuracy $a$ rises, lowering the threshold probability $\overline{p}$.

Now suppose I can choose my own type. Should I be a truth-seeker or an ideologue? Again, my choice depends on the probability $p$ that I play the learning game. It turns out I’m better off seeking truth if and only if $p$ exceeds another threshold $\underline{p}$ that depends on $a$.⁴ This threshold has two interesting properties:

It’s positive, so if $p$ is small enough then I’m better off being an ideologue.
It’s smaller than $\overline{p}$, so if I’m better off being an ideologue then I’m also better off choosing an ideologue as my friend.

Intuitively, if the truth doesn’t matter then there’s no point seeking it. I might as well be an ideologue and choose ideological friends who always share my experience.

One can extend this model to choosing many friends with a range of accuracies and biases. Some people might be more truth-seeking than others. Some people might have correlated experiences because they get information from the same like-minded sources. These correlations determine the “experience portfolio” my friends can provide. But the goal of this portfolio—whether I want it to provide truth or connection—still depends on how much I care about learning the truth.

Indeed this assumption motivates the extensive literature on social learning “failures.” These failures arise from, e.g., unequal influence (Acemoglu et al., 2011; Golub and Jackson, 2010), network structure (Chandrasekhar et al., 2020; Dasaratha and He, 2021), herding (Banerjee, 1992; Bikhchandani et al., 1992; Smith and Sørensen, 2000), conformity (Mohseni and Williams, 2021), misinformation (Mostagir and Siderius, 2022), and misinterpretation (Frick et al., 2020). ↩︎
Choosing another truth-seeker makes me win the learning game with probability $a$ and the connecting game with probability $a^2+(1-a)^2$. Both of these probabilities exceed 0.5, the probability of winning either game if I choose an ideologue. ↩︎
If I’m an ideologue, then my ex ante chance of winning is $pa+0.5(1-p)$ if I choose a truth-seeking friend and $0.5p+(1-p)$ if I choose another ideologue. ↩︎
The exact probability is $$\underline{p}\equiv \frac{4a(1-a)}{2a-1+4a(1-a)}.$$ It comes from comparing the truth-seeker’s indirect objective $$pa+(1-p)(a^2+(1-a)^2)$$ and the ideologue’s indirect objective $$\begin{cases}pa+0.5(1-p)&\text{if}\ p\ge\overline{p}\\0.5p+(1-p)&\text{otherwise}.\end{cases}$$ These functions coincide when $p\in\{\underline{p},1\}$. ↩︎

Gender sorting among economists

Fri, 03 Jun 2022 00:00:00 +0000

I have a new paper on gender sorting in economic research teams. Here’s the abstract:

I compare the co-authorship patterns of male and female economists, using historical data on National Bureau of Economic Research working papers. Men tended to work in smaller teams than women, but co-authored more papers and so had more co-authors overall. Both men and women had more same-gender co-authors than we would expect if co-authorships were random. This was especially true for men in Macro/Finance.

I show that the NBER co-authorship network is assortatively mixed with respect to gender, and has been since the late 1980s. This could reflect explicit choices to work in same-gender teams. But it could also be a consequence of other choices (e.g., which topics to research) that lead to gender sorting. I leave this distinction open for future research.

The paper uses data from nberwp, an R package I’ve been working on since 2019. I’ve described and used the package in several blog posts:

The paper is in Economics Letters, which publishes concise papers at most 2,000 words long. This seemed appropriate for my paper: it’s longer than a blog post but shorter than an AER epic. The few words mask the many hours spent collecting and cleaning the data (e.g., manually identifying about 2,500 authors’ genders). Such is the nature of publishing empirical work.

Gender differences in publication rates within NBER programs

Sat, 28 May 2022 00:00:00 +0000

My previous post showed that NBER research programs with higher female representation tend to have fewer papers published in the “Top Five” economics journals. A reader suggested comparing Top Five publication rates among men and women within each program. This comparison reveals whether men and women publish at different rates despite writing about similar topics. Here’s the chart:

Most points lie below the dashed diagonal line. Such points represent programs in which male-authored papers are more likely to be in Top Fives than female-authored papers. This “male premium” in Top Five publication rates doesn’t appear to differ between programs in the “Micro” and “Macro/Finance” subfields defined in Davies (2022). The premium is largest for the Corporate Finance (CF) program and most negative for the Development of the American Economy (DAE) program.

How do these patterns compare to publication rates across all journals? Here’s the corresponding chart:

Looking at all journals, rather than only Top Fives, lowers the “male premium” in publication rates. It also reveals differences between subfields: some Micro programs have negative premia, but all Macro/Finance programs have positive premia.

What explains these patterns? Here are two theories:

Women submit papers to Top Fives less often. This would be consistent with evidence that women shy away from competition relative to equally competent men (see, e.g., Niederle and Vesterlund, 2011).
Top Five referees and editors discriminate against women. This would be consistent with evidence that women are held to higher editorial standards (Card et al., 2020; Hengel, 2017).

Unfortunately I can’t test these theories with my data. I observe publication outcomes, but not journal submissions or referee/editor biases. And the two theories aren’t mutually exclusive: women may submit less often because they anticipate discrimination.

Publication outcomes of NBER working papers

Tue, 17 May 2022 00:00:00 +0000

The latest version of nberwp (1.2.0) contains information on where NBER working papers are published:

Outlet	Papers	Share (%)
Top Five journals	3,832	12.7
Other journals	14,792	49.2
Book/chapters	3,096	10.3
Unpublished	8,363	27.8

About 62% of working papers are published or forthcoming in peer-reviewed journals. One in five of these papers are in the “Top Five:” the American Economic Review, Econometrica, the Journal of Political Economy, the Quarterly Journal of Economics, and the Review of Economic Studies. These journals are the tallest peaks in the world of economic research. Publishing in them can be vital for career progression.

The chart below counts papers by decade and publication outcome. As the number of NBER working papers grew, so did the number appearing in journals and the Top Five. Yet the space available in Top Fives was relatively constant between the 1970s and 2010s (Card and DellaVigna, 2013). NBER working papers occupied an increasing share of that space.

Why are so many NBER working papers in the Top Five? Here are four possible reasons:

The NBER working paper series is among the most read series in economics. More readers means more feedback, which helps authors improve their papers and make them Top Five-worthy.
Each paper has an NBER-affiliated author. “Affiliates are selected through a rigorous and competitive process” (see here). This process may select authors more willing and able to pursue Top Five publications.
NBER working papers tend to apply cutting-edge methods to policy-relevant issues. This makes papers attractive to Top Five editors, who want to publish frontier, impactful research.
Top Five editors tend to be NBER affiliates. Club co-membership might help authors during peer-review.

Gender differences

nberwp contains information on author genders, so we can compare the representation of women among papers with different publication outcomes. Here’s one approach:

Compute the fraction of authors on each paper who were women.
Sum these fractions across all papers.
Divide by the number of papers.

These three steps deliver an estimate of the share of papers written by women. This estimate equals 16.5% across all NBER working papers. The chart below separates by decade and publication outcome. Female representation grew over time, both overall and among papers published in journals. But the growth was slower among papers published in the Top Five. Women were consistently less represented among papers published in the Top Five than among other papers. Overall, only 12.5% of NBER working papers in the Top Five were written by women.

What explains the relative gender gap for papers in the Top Five? Perhaps it reflects what men and women write about. One way to explore this is to compare female representation and Top Five publication rates across the NBER’s research programs, which “correspond loosely to traditional field[s] of study within economics.” I present that comparison in the chart below.¹ The horizontal axis measures female representation using the estimator defined above; the vertical axis measures the share of papers in each program published in the Top Five.

Programs with lower female representation tend to have proportionally more papers in the Top Five. The Monetary Economics (ME) program, which has the lowest female representation, has more papers in the Top Five than the program on Children (CH), which has the highest female representation. Papers in the Economic Fluctuations and Growth (EFG) program tend to focus on “big picture” questions and often land in Top Fives. Papers in the Health Economics (HE) program tend to focus on more specific questions, and often land in field or medical journals. But papers in the HE program are about three times as likely to be written by women than are papers in the EFG program. This difference in likelihoods contributes to lower female representation among NBER working papers published in the Top Five.

But why are the likelihoods different? Why do proportionally fewer women write papers on growth than on children? Perhaps this reflects what men and women enjoy researching. But, again, publishing in the Top Five can be vital for career progression. So, at the margin, I’d expect researchers to choose topics more likely to land in Top Five journals. These choices do not appear in my data. I’m interested to learn more—reach out if you are too.

I compare publication rates among men and women within each program here. ↩︎

Judging economic models

Mon, 16 May 2022 00:00:00 +0000

Lots of people criticize economic models for being unrealistic. “Humans are irrational,” they cry; “financial markets are inefficient.” These criticisms are valid, but they miss the point. Models aren’t meant to be realistic. They’re meant to simplify reality: to focus our attention on what’s relevant and abstract from what isn’t. All models are wrong—we shouldn’t judge them for their realism.

How should we judge a model? Here are two criteria:

it makes predictions that agree with data;
it helps us think clearly about how the world works.

Economists use models to generate predictions, such as “people buy less when prices rise.” We test these predictions using data from the real world. When the predictions and data disagree, we reject the model and search for something better. This search leads to new models with new predictions. Under the first criterion, “better” models make more true predictions.

Model predictions come in different forms. “Within-sample” predictions tell us about data we’ve seen; “out-of-sample” predictions tell us what to expect in data we haven’t seen.

We test within-sample predictions by asking if the model “fits” the data it was designed to explain. Bad models fail this test. But useless models can pass it. For example, suppose I have a list of quantity-price pairs. I use the list as my “model” of demand. My model fits the data because the data fit the data. But my model says nothing about why people buy a given quantity at a given price. It also says nothing about how much people buy at other prices.

Hence, we also test out-of-sample predictions. We ask if the model fits relevant data it wasn’t designed to explain. This helps us learn whether the model captures general principles rather than contextual quirks. It also helps us be logically consistent. For example, suppose I want to explain some pattern Y. I write down a model in which I assume behavior X, which implies Y. But X also implies pattern Z. Do I think Z is reasonable? Do I observe it empirically? If not, then I should revise my model and not assume X. Writing down the model makes my assumptions explicit and easier to correct.

The second criterion makes room for some models with false predictions. The efficient market hypothesis is a good example. It predicts that you can’t use public information to “beat the market.” This prediction is false—RenTech offers one counter-example. But the EMH helps us organize our thoughts about how, when, and why prices incorporate information. It guides our intuitions. It also provides a benchmark against which to compare models of inefficient markets.

Another benchmark model is that of DeGroot learning. Its main prediction—“under some conditions, society reaches a consensus eventually”—is hard to test because “eventually” never arrives. But the model offers a tractable (and surprisingly realistic) way to study how people learn. We can enrich the model by adding homophily or misinformation. These additions make the model more realistic but more complex. Having a benchmark helps us assess whether the extra realism is “worth” the extra complexity (e.g., by adding explanatory power).

Sometimes the realism is worth the complexity. This is especially true when we use models to help us design new systems. As Jackson (2019) notes,

One would never design a large airliner without carefully modeling its aeronautic properties, and testing it thoroughly via simulations and test flights of prototypes, before loading it with passengers. Why should designing a market for health insurance be any different? Models have the virtue of offering us insight in to what should we expect in scenarios that have never been tried before.

Models give us prototypes to test. They let us run theoretical experiments when “real” experiments are expensive or infeasible. They guide our search for better designs. The “best” design might be complicated because reality is complicated. Ignoring some complications in the model may lead us astray. But we don’t need all the complications: health insurance markets don’t depend critically on whether I buy blue jeans or black.

Moreover, when designing new systems, the object of interest is the design rather than the model of it. The model is just a tool. We use it to focus on relevant factors and abstract from irrelevant factors. Different models arise from making different choices about which factors are relevant. Our job as economic modelers is to make those choices well.

Echo chambers can be useful

Fri, 08 Apr 2022 00:00:00 +0000

Talking to lots of people who know different things helps us learn. Yet many of us sort into echo chambers, only talking to a few like-minded people. Doesn’t this hinder learning?

Jann and Schottmüller (2021) answer: “not always.” Different people know and want different things. These differences give us persuasion temptations: we tell selective stories to influence others’ behavior. We don’t share everything we know. Sorting into echo chambers removes our persuasion temptations. This leads to more sharing and learning.

The authors formalize this idea as follows: Each agent has a binary bit of information. Summing these bits gives the “state.” Agents take actions based on (i) their individual biases and (ii) their beliefs about the state. They want other agents to take similar actions. Biases are common knowledge.

Agents learn about the state by talking to each other. But before they talk, agents sort into “rooms.” Agents only talk to people in their room. They choose what to say based on how it influences their roommates’ actions. They either

share their bit (i.e., tell the truth),
share one minus their bit (i.e., lie), or
share a zero or one randomly.

Agents can “babble” by sharing a zero or one independently of their bit. For example, they could flip a coin and share a one if it lands on heads. Babbling is uninformative.

Jann and Schottmüller first study the most informative equilibrium of the bit-sharing game played in each room. In this equilibrium, everyone tells the truth or babbles. Agents close to the mean bias among their roommates tell the truth. Agents far from that mean babble.

The authors then study how agents choose rooms. These choices anticipate how agents talk in each room. In equilibrium, no agent wants to change rooms on their own. This equilibrium is “welfare-optimal” if it leads to more truth-telling than any other equilibrium.

Whether the equilibrium room choices are welfare-optimal depends on agents’ biases. For example, suppose agents are polarized: their biases take one of two values with equal probability. The difference between these values captures the level of polarization. The authors show that

full segregation is welfare-optimal if polarization is high enough, and
full integration is welfare-optimal if polarization is low enough.

If polarization is high enough then having opposite-minded agents in the same room creates persuasion temptations. These temptations lead to babbling in the most informative equilibrium. Full segregation prevents babbling. On the other hand, if polarization is low enough then no-one has persuasion temptations and so no-one babbles. Then having everyone in the same room is welfare-optimal.

More polarization leads to less bit sharing in equilibrium. This is because polarization creates persuasion temptations. Segregation actually removes some of these temptations. But it can’t restore communication between agents who moved to different rooms. Thus, polarization lowers welfare despite segregation, rather than because of it. The authors summarize this point nicely:

“One could think of echo chambers as society’s (decentralized) defense mechanism against polarization. Like fever in a human body, segregation occurs as the effect of an underlying problem, and its presence hence indicates that polarization is at problematic levels. Echo chambers, and segregation more generally, are hence a symptom of polarization. And just like artificially lowering fever, treating the symptom without addressing the cause can in fact exacerbate the situation. Reducing polarization will weakly improve welfare; reducing segregation may not.”

The authors go on to study extensions of their model. For example, they show that adding public information can crowd out incentives to tell the truth. They also show that their model agrees with data from Twitter. The authors suggest that social media platforms do more than connect people: they provide infrastructure for efficient segregation.

Jann and Schottmüller close by calling for more nuanced discussion of echo chambers: Yes, they limit the diversity of whom we meet and talk to. But

“there is simply no use in meeting people with a very diverse set of opinions and very useful information, if there is no way to get that information out of them.”

Tolerance and compromise in social networks

Fri, 01 Apr 2022 00:00:00 +0000

Everyone has beliefs about how they should behave. But people differ in their beliefs. They also differ in their tolerance of others’ beliefs. These differences affect who become friends. Some people “stick to their guns” and befriend only those who agree. Others are more tolerant and befriend others who disagree. Such people are more willing to compromise, changing their behavior to accommodate friends’ beliefs.

Genicot (2022) studies tolerance and compromise in social networks. She describes a finite population of agents with different “ideal” actions. Agents prefer taking their ideal actions. They also prefer friends who take their ideal actions. An agent’s “tolerance” is the largest deviation from their ideal they can accept in a friend’s action.

Agents take actions before making friends. An agent “compromises” if they take an action different than their ideal. Compromise is costly but may lead to beneficial friendships. Agents weigh these costs and benefits when taking actions. Genicot studies the equilibrium in which no-one wants to change their action.

If everyone has the same tolerance then no-one compromises. The reason is as follows: No-one wants to compromise more than is necessary for their friends’ acceptance. Thus, anyone who compromises must do so “minimally” for at least one friend. This friend must also compromise because tolerances are equal. In fact, they must compromise more to make the friendship net beneficial. But then they must have another friend who compromises even more. We can keep applying this argument to find agents who compromise more and more, which is impossible because the population is finite.

Compromise thus depends on differences in tolerance. Agents compromise by deviating from their ideals towards the ideals of relatively intolerant friends. Some compromises are one-sided, where the intolerant friend stands their ground. Other compromises are two-sided. Two-sided compromises rely on intolerant “bridge” agents, who bring their tolerant friends’ actions close enough together to be mutually acceptable.

Compromise also depends on how tolerances and ideals covary. If agents with “extreme” ideals are less tolerant then two-sided compromise is impossible. This is because agents compromise towards intolerant extremists. Consequently, actions tend to be more polarized than ideals. In contrast, if extremists are more tolerant then agents compromise towards the median. This makes the population more connected.

Genicot interprets these results in light of recent political trends. She cites evidence of intolerance among liberals and conservatives, and of rising polarization in the United States. These patterns are consistent with Genicot’s model. If people want to make friends, but making friends requires compromise towards extremes, then people will behave more extremely.

Genicot closes with guidance on finding tolerant people:

Looking at the identity of the members of a person’s social network may overestimate the tolerance exhibited by the person. The distance between a person’s identity and her friends’ behaviors would likely tell us more about her tolerance.

Tolerance isn’t about having diverse friends; it’s about not forcing friends to accommodate your beliefs.

Persuading with anecdotes

Fri, 18 Mar 2022 00:00:00 +0000

My previous post explained why rational people can prefer like-minded information sources. This preference leads media outlets to compete by targeting biased audiences. Such targeting can take (at least) two forms:

presenting content in a way some people like and others don’t;
only sharing content that some people like and others don’t.

Haghtalab et al. (2021) study the second form. They consider a pair of Bayesian agents called Sender (he) and Receiver (she). Both agents take actions (e.g., get vaccinated) based on their beliefs about an unknown state (e.g., whether vaccines are effective) and their “moral stance.” Sender observes some noisy signals about the state before taking his action. He sends one of those signals to Receiver before she takes her action. Sender’s “communication scheme” determines which signal he sends. He chooses this scheme knowing his and Receiver’s moral stances, but before observing any signals.

Sender wants Receiver to take the same action as him. He chooses the scheme that minimizes the mean distance (across signal realizations) between his and Receiver’s actions. This distance has three components:

A “signalling loss” from sending one signal rather than many;
A “persuasion temptation” from wanting Receiver to take the same action;
An unavoidable loss from differences in moral stances.

If Receiver knows the communication scheme then Sender just minimizes the signalling loss. This is because Receiver can “undo” any bias in the scheme, so persuasion is impossible. But if Receiver doesn’t know the scheme then Sender trades off the signalling loss and the persuasion temptation. This makes both agents worse off because the signal sent is less informative.

Suppose the signal distribution is “well-behaved” (e.g., single-peaked) and Receiver knows the communication scheme. If Sender observes enough signals then he always prefers unbiased schemes. Intuitively, Sender wants to send all the signals he observes but can only send one. He sends the “most representative” signal: the one closest to the mean. But this logic breaks down when Sender observes too few signals. In that case, the mean signal is noisy and extreme signals can be more informative. This can make Receiver prefer biased schemes.

Now suppose Receiver doesn’t know the communication scheme. Then Sender chooses more biased schemes when he observes more signals. He does so because of his persuasion temptation. Again, this makes both agents worse off. So Receiver prefers when Sender shares her moral stance because then their incentives are aligned.

This preference for like-mindedness also depends on the number and distribution of signals Sender observes. For example, suppose Receiver chooses between

an expert with many signals but a different moral stance, and
a layperson with one signal but the same moral stance.

Receiver prefers the layperson when the signal distribution is thick-tailed. This is because the expert observes more signals “in the tail,” so they send a more extreme (and, thus, less informative) signal due to their persuasion temptation.

It may seem restrictive that Sender shares a raw signal rather than, say, his posterior estimate of the state. But such sharing reflects how real people talk to each other. Real people don’t trade summary statistics on vaccination outcomes. Instead, they trade anecdotes like “I felt tired and achy after my booster shot.” News outlets do the same: they typically report on individual events rather than aggregate patterns. (When was the last time you saw a base rate in the news?) These anecdotes in conversation and events in the news correspond to signals in the authors’ model.

Ideological bias and trust in information sources

Wed, 09 Mar 2022 00:00:00 +0000

If people were Bayesian, then giving them more information would help them learn the truth and reach consensus. But most people aren’t Bayesian. They can have, e.g., confirmation bias or limited memory. These cognitive “errors” can lead people with access to lots of information to disagree.

Gentzkow, Wong and Zhang (2021) show that such errors are not necessary for disagreement. The authors consider Bayesian agents with access to some information sources. Agents don’t know which sources they can trust. They learn to trust sources that are consistent with their personal experiences. Variation in experiences can lead agents to disagree, even as the number of sources grows.

In Gentzkow et al.‘s model, sources send noisy signals about many “states.” States represent objective facts about different issues, such as mask effectiveness or the extent of global warming. States vary in their “ideological valence:” how favorable they are to liberals or conservatives. Sources vary in their accuracy (i.e., signals’ correlation with states) and ideological bias (i.e., signals’ correlation with ideological valences). Agents want to learn sources’ accuracies and biases, which are constant across issues.

Agents learn by comparing signals to their personal experience, such as friends’ disease outcomes or local weather events. Experiences, like sources, vary in their accuracy and ideological bias (due to, e.g., choosing like-minded friends). However, agents believe their experience is unbiased. This belief gives each agent a baseline against which to compare signals. Different agents have different baselines, leading to different inferences from the signals they receive.

The authors show that biased agents prefer like-minded sources. When comparing sources with the same accuracy but opposite biases, agents think the source sharing their bias is more accurate. Agents also under-estimate the bias of like-minded sources and think unbiased sources are opposite-minded. These patterns stem from agents’ dogmatic beliefs that their experiences are unbiased. Agents learn the truth if and only if their experiences really are unbiased.

The authors also show that biases in experiences can lead to disagreements about states. Suppose the bias in two agents’ experiences have equal magnitudes but opposite signs. As the magnitude grows, the agents become more likely to disagree. Having more sources doesn’t always help. It can actually lead to more disagreement because agents can combine sources to construct a maximally like-minded composite.

This demand for like-minded sources affects media market outcomes. People devote their attention to media outlets they trust. Outlets profit from capturing peoples’ attention. The authors show that monopolists maximize profit by offering accurate and unbiased information, whereas competing outlets differentiate by targeting biased audiences.

All of these results rely on some technical assumptions. For example, agents only see normally distributed data. This makes the math (relatively) easy but limits generality. I don’t mind those assumptions because they lead to clear, testable hypotheses about why people disagree. What remains is to test them.

Pre-screening evidence

Wed, 02 Mar 2022 00:00:00 +0000

Cheng and Hsiaw (2022) study an agent who wants to learn about a binary state (e.g., whether a vaccine is safe). An information source (e.g., Fauci or a Twitter thread) sends noisy signals about the state. But the agent doesn’t know whether the source is “credible.” They have to infer credibility from the signals received.

The authors compare two types of agents: Bayesians and “pre-screeners.” Both types respond to new evidence in two steps:

Update beliefs about whether the source is credible.
Update beliefs about the state, weighing the evidence by its credibility.

The two types differ in the second step. Whereas Bayesians use their prior beliefs about credibility, pre-screeners use their updated beliefs. Pre-screeners “double-dip” the evidence: once to evaluate its credibility, and again to evaluate its likelihood given its credibility. Bayesians never double-dip: they evaluate credibility and likelihood independently.

Bayesians and pre-screeners can have different responses to the same evidence. For example, suppose an agent thinks (i) they have COVID-19 and (ii) their testing procedure is accurate (i.e., credible). The procedure is accurate, but they actually don’t have the virus. They take a test; it returns “negative.” Surprised, they take another test; “negative.” They keep taking tests; the tests keep returning “negative.”

Suppose the agent is Bayesian. The first result makes them think the testing procedure is inaccurate. But they evaluate the first result using their prior belief about accuracy, which is that the procedure is accurate. Consequently, they weaken their belief in having the virus. This makes the second result less surprising than the first. That result weakens the agent’s belief further. Eventually, the agent stops being surprised: they gradually learn the procedure is accurate and they don’t have the virus.

Now suppose the agent pre-screens. The first result makes them think the testing procedure is inaccurate. They evaluate the first result using their updated belief about accuracy. Consequently, they strengthen their belief in having the virus. This makes the second result less surprising than the first: the agent expects an inaccurate result and, from their perspective, gets one. They strengthen their belief further. Eventually, the agent wonders, “if the procedure was inaccurate then it wouldn’t keep returning the same result. Perhaps it is accurate after all?” The agent then evaluates all the results as though the procedure is accurate, weakening their belief in having the virus sharply. Suddenly, the agent learns the procedure is accurate and they don’t have the virus.

The Bayesian and pre-screener reach the same conclusion in different ways. The Bayesian learns gradually because they evaluate each result independently. The pre-screener learns suddenly because they evaluate the entire history of results as though they knew the testing procedure was accurate all along. Cheng and Hsiaw show generally that, so long as signals tend to agree with the state, pre-screeners learn the truth eventually if Bayesians do too.

But “eventually” can mean “after an unhelpfully long time.” In the meantime, pre-screeners and Bayesians can disagree about the state. They do so because they disagree about credibility. Pre-screeners either “over-trust” or “under-trust” the source relative to Bayesians. Over-trust leads pre-screeners to think the state favored by the evidence is more likely than Bayesians think. Under-trust has the opposite effect.

Cheng and Hsiaw call this pattern “correlated disagreement:” pre-screeners’ beliefs tend to agree with whether they think sources supporting those beliefs are credible. For example, imagine collecting peoples’ opinions on (i) whether vaccines are safe and (ii) the credibility of sources saying vaccines are safe. If people pre-screen then their opinions on (i) and (ii) should be positively correlated.

Correlated disagreement is one testable prediction of Cheng and Hsiaw’s model. Another prediction is “first impression bias:” pre-screeners are more likely to think a source is credible if its first few signals agree with each other. Bayesians have no such bias because their final beliefs don’t depend on which signals they see first.

A third prediction concerns how pre-screeners react to new evidence. They over-react if the evidence confirms their priors and think the source is credible. They under-react if the evidence contradicts their priors or think the source is not credible.

Cheng and Hsiaw also discuss how such reactions influence asset prices. Disagreements over credibility (e.g., of financial reports) lead to disagreements over fundamental values. These disagreements lead to speculation: people buy assets hoping to cash-in on others’ over-confidence. Speculation raises asset prices. Eventually, disagreements over credibility disappear, investors wise up, and prices come crashing down.

Cheng and Hsiaw argue that disagreement comes from people not being Bayesian. In contrast, Gentzkow, Wong and Zhang (2021) argue that disagreement can arise even if people are Bayesian. I summarize their argument here.

Communicating science

Wed, 23 Feb 2022 00:00:00 +0000

Science is hard, and communicating it to a broad audience is even harder. I don’t envy Anthony Fauci or his colleagues, who must summarize the science on vaccines to a range of parties with a range of prior beliefs.

What does it mean to communicate science “optimally?” Andrews and Shapiro (2021) offer some guidance. They consider an analyst who sends an audience a report about some data. Audience members vary in their beliefs and objectives, and so vary in their reactions to a given report. The analyst chooses a report that maximizes audience members’ welfare given their reactions.

Andrews and Shapiro compare two models:

In the “communication model,” the analyst provides information and lets audience members take their preferred decision given that information.
In the “decision model,” the analyst takes a decision on audience members’ behalf.

These two models generally have different optimal reporting rules. For example, suppose the analyst has experimental data on a new drug. Their audience is a range of governments, who want to subsidize the drug if its effect is positive and tax it if its effect is negative. Everyone knows the true effect is non-negative, so taxing is never optimal. But the analyst may estimate a negative effect due to sampling error in the experiment. Under the decision model, the analyst optimally censors negative estimates because imposing a tax is worse than doing nothing. Conversely, under the communication model, censoring is never optimal because it throws away information about effect size.

In this example, the analyst optimally reports a sufficient statistic for the effect size (e.g., the mean outcomes within the experiment’s treatment and control groups). In fact, reporting a sufficient statistic is always optimal under the communication model.

The communication and decision models can even have different admissible reporting rules. For example, suppose the analyst has data on (true) treatment effects for many drugs. Their audience is a range of physicians, who want to give the best drug to their patients. Every physician believes that any drug is better than none (e.g., because patients can’t self-prescribe). The analyst considers two reporting rules:

Choose randomly among the drugs with the largest effect.
If all drugs have the same effect then do nothing; otherwise, use rule 1.

Every physician prefers some drug to none, so doing nothing is never optimal. Consequently, the first rule always dominates the second under the decision model. But physicians can reconstruct the first rule from the second, so the second rule is (weakly) more informative. Consequently, it always dominates the first rule under the communication model.

Andrews and Shapiro discuss more features of the two models, such as what happens when the analyst puts different weights on different audience members’ welfare. The authors also discuss implications of their analysis for research practice, such as for reporting estimates of structural economic models.

One thing Andrews and Shapiro don’t discuss is what happens when the audience is boundedly rational. Audience members may find it hard to process information—hence getting the analyst to process it for them—due to cognitive or emotional costs. Such costs make the audience rationally inattentive. Bloedel and Segal (2021) study optimal communication to a rationally inattentive audience, but use the language of Bayesian persuasion (Kamenica and Gentzkow, 2011) rather than statistical decision theory.

Another missing discussion is what happens when the audience don’t trust the analyst. Suppose some audience members believe the analyst lies or suppresses truths for conspiratorial reasons. How should the analyst respond to this belief? How should they trade off the cognitive costs induced by providing information with the conspiracy theories induced by suppressing it? This trade-off is both deliciously complicated and faced by real-world science communicators. Again, I do not envy them!

Assortativity and correlation coefficients

Thu, 17 Feb 2022 00:00:00 +0000

This is a technical follow-up to a previous post on assortative mixing in networks. In a footnote, I claimed that Newman’s (2003) assortativity coefficient equals the Pearson correlation coefficient when there are two possible node types. This post proves that claim.

Notation

Consider an undirected network $N$ in which each node has a type belonging to a (finite) set $T$. The assortativity coefficient is defined as $$r=\frac{\sum_{t\in T}x_{tt}-\sum_{t\in T}y_t^2}{1-\sum_{t\in T}y_t^2},$$ where $x_{st}$ is the proportion of edges joining nodes of type $s$ to nodes of type $t$, and where $$y_t=\sum_{s\in T}x_{st}$$ is the proportion of edges incident with nodes of type $t$. The Pearson correlation of adjacent nodes’ types is given by $$\DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Var}{Var} \rho=\frac{\Cov(t_i,t_j)}{\sqrt{\Var(t_i)\Var(t_j)}},$$ where $t_i\in T$ and $t_j\in T$ are the types of nodes $i$ and $j$, and where (co)variances are computed with respect to the frequency at which nodes of type $t_i$ and $t_j$ are adjacent in $N$.

Proof

Let $T=\{a,b\}\subset\mathbb{R}$ with $a\not=b$. I show that the correlation coefficient $\rho$ and assortativity coefficient $r$ can be expressed as the same function of $y_a$ and $x_{ab}$, implying $\rho=r$.

Consider $\rho$. It can be understood by presenting the mixing matrix $X=(x_{st})$ in tabular form:

`$t_i$`	`$t_j$`	`$x_{t_it_j}$`
`$a$`	`$a$`	`$x_{aa}$`
`$a$`	`$b$`	`$x_{ab}$`
`$b$`	`$a$`	`$x_{ba}$`
`$b$`	`$b$`	`$x_{bb}$`

The first two columns enumerate the possible type pairs $(t_i,t_j)$ and the third column stores the proportion of adjacent node pairs $(i,j)$ with each type pair. This third column defines the joint distribution of types across adjacent nodes. Thus $\rho$ equals the correlation of the first two columns, weighted by the third column. (Here $x_{ab}=x_{ba}$ since $N$ is undirected.) Now $t_i$ has mean $$\DeclareMathOperator{\E}{E} \begin{aligned} \E[t_i] &= x_{aa}a+x_{ab}a+x_{ba}b+x_{bb}b \\ &= y_aa+y_bb \end{aligned}$$ and second moment $$\begin{aligned} \E[t_i^2] &= x_{aa}a^2+x_{ab}a^2+x_{ba}b^2+x_{bb}b^2 \\ &= y_aa^2+y_bb^2, \end{aligned}$$ and similar calculations reveal $\E[t_j]=\E[t_i]$ and $\E[t_j^2]=\E[t_i^2]$. Thus $t_i$ has variance $$\begin{aligned} \Var(t_i) &= \E[t_i^2]-\E[t_i]^2 \\ &= y_aa^2+y_bb^2-(y_aa+y_bb)^2 \\ &= y_a(1-y_a)a^2+y_b(1-y_b)b^2-2y_ay_bab \end{aligned}$$ and similarly $\Var(t_j)=\Var(t_i)$. We can simplify this expression for the variance by noticing that $$x_{aa}+x_{ab}+x_{ba}+x_{bb}=1,$$ which implies $$\begin{aligned} y_b &= x_{ab}+x_{bb} \\ &= 1-x_{aa}-x_{ba} \\ &= 1-y_a \end{aligned}$$ and therefore $$\begin{aligned} \Var(t_i) &= y_a(1-y_a)a^2+(1-y_a)y_ab^2-2y_a(1-y_a)ab \\ &= y_a(1-y_a)(a-b)^2. \end{aligned}$$ We next express the covariance $\Cov(t_i,t_j)=\E[t_it_j]-\E[t_i]\E[t_j]$ in terms of $y_a$ and $x_{ab}$. Now $$\begin{aligned} \E[t_it_j] &= x_{aa}a^2+x_{ab}ab+x_{ba}ab+x_{bb}b^2 \\ &= (y_a-x_{ab})a^2+2x_{ab}ab+(y_b-x_{ab})b^2 \\ &= y_aa^2+y_bb^2-x_{ab}(a-b)^2 \end{aligned}$$ because $x_{ab}=x_{ba}$. It follows that $$\begin{aligned} \Cov(t_i,t_j) &= y_aa^2+y_bb^2-x_{ab}(a-b)^2-(y_aa+y_bb)^2 \\ &= y_a(1-y_a)a^2+y_b(1-y_b)b^2-2y_ay_bab-x_{ab}(a-b)^2 \\ &= y_a(1-y_a)(a-b)^2-x_{ab}(a-b)^2, \end{aligned}$$ where the last line uses the fact that $y_b=1-y_a$. Putting everything together, we have $$\begin{aligned} \rho &= \frac{\Cov(t_i,t_j)}{\sqrt{\Var(t_i)\Var(t_j)}} \\ &= \frac{y_a(1-y_a)-x_{ab}}{y_a(1-y_a)}, \end{aligned}$$ a function of $y_a$ and $x_{ab}$.

Now consider $r$. Its numerator equals $$\begin{aligned} \sum_{t\in T}x_{tt}-\sum_{t\in T}y_t^2 &= x_{aa}+x_{bb}-y_a^2-y_b^2 \\ &= (y_a-x_{ab})+(y_b-x_{ab})-y_a^2-y_b^2 \\ &= y_a(1-y_a)+y_b(1-y_b)-2x_{ab} \\ &\overset{\star}{=} 2y_a(1-y_a)-2x_{ab} \end{aligned}$$ and its denominator equals $$\begin{aligned} 1-\sum_{t\in T}y_t^2 &= 1-y_a^2-y_b^2 \\ &\overset{\star\star}{=} 1-y_a^2-(1-y_a)^2 \\ &= 2y_a(1-y_a), \end{aligned}$$ where $\star$ and $\star\star$ both use the fact that $y_b=1-y_a$. Thus $$r=\frac{y_a(1-y_a)-x_{ab}}{y_a(1-y_a)},$$ the same function of $y_a$ and $x_{ab}$, and so $\rho=r$ as claimed.

Writing $\rho=r$ in terms of $y_a$ and $x_{ab}$ makes it easy to check the boundary cases: if there are no within-type edges then $y_a=x_{ab}=1/2$ and so $\rho=r=-1$; if there are no between-type edges then $x_{ab}=0$ and so $\rho=r=1$.

Appendix: Constructing the mixing matrix

The proof relies on noticing that $x_{ab}=x_{ba}$, which comes from undirectedness of the network $N$ and from how the mixing matrix $X$ is constructed. I often forget this construction, so here’s a simple algorithm: Consider some type pair $(s,t)$. Look at the edges beginning at type $s$ nodes and count how many end at type $t$ nodes. Call this count $m_{st}$. Do the same for all type pairs to obtain a matrix $M=(m_{st})$ of edge counts. Divide the entries in $M$ by their sum to obtain $X$.

nberwp 1.1.0

Fri, 21 Jan 2022 00:00:00 +0000

A new version of nberwp, an R package containing data on NBER working papers, is available on CRAN. This version adds information about (i) papers published in July–December 2021 and (ii) author sexes.

Papers from late 2021

The second half of 2021 saw 649 new NBER working papers by 1,663 unique authors, 503 of whom had not published in the series previously. Those counts were down (from 858, 2,094, and 683, respectively) from the second half of 2020, but roughly in-line with pre-pandemic trends:

nberwp 1.1.0 also corrects some false merges and splits among authors who published before July 2021. These corrections lowered the number of such authors from 15,437 in version 1.0.0 to 15,430 in version 1.1.0.

Author sexes

nberwp 1.1.0 adds information about author sexes, allowing one to, e.g., visualize the growing female representation among NBER working paper authors:

I obtain sex information by matching authors’ names with baby name and Facebook data, and through manual identification. I document my matching and manual procedures in “Sex-based sorting among economists: Evidence from the NBER,” a new paper comparing males’ and females’ co-authorship patterns.

Hypothesis tests and Bayesian reasoning

Thu, 06 Jan 2022 00:00:00 +0000

Most empirical research relies on hypothesis testing. We form null and alternative hypotheses (e.g., a regression coefficient equals zero or doesn’t), collect some data, and reject the null if it implies those data are rare enough. How rare is “enough” depends on the context, but a common rule is to reject the null if the p-value—that is, the probability of observing the same or rarer data given the null is true—is smaller than 0.05. However, this rule can lead to very different conclusions than Bayesian reasoning.

For example, suppose I’m the government trying to collect taxes. I know 1% of taxpayers cheat (e.g., by under-reporting their income), so I hire an auditor to detect cheating. The auditor makes occasional mistakes: they incorrectly detect cheating among 2% of non-cheaters. But the auditor never fails to detect cheating when it happens.

Suppose the auditor tells me Joe cheated on his taxes. Should I prosecute him for fraud? Letting “Joe is innocent” be the null hypothesis and “Joe is guilty” be the alternative, the p-value of the auditor’s message is simply their false positive rate: 0.02. This p-value is smaller than the 0.05 “critical value” below which I reject nulls, so I take the auditor’s message as strong evidence of guilt.

Now consider a random sample of a thousand taxpayers. The auditor accuses all ten cheaters in this sample of cheating. But the auditor also accuses 20 of the 990 non-cheaters of cheating. So only one in three accusees actually cheated—if I thought everyone like Joe was guilty, I would be wrong two thirds of the time! That’s hardly evidence of guilt “beyond reasonable doubt.”

What’s going on? Why does the hypothesis test suggest Joe is guilty, when simply counting true and false accusations suggests he’s innocent?

The suggestions differ because they are based on different probabilities. The hypothesis test uses the probability that the auditor detects Joe cheating given he is innocent: 0.02. But the counting argument uses the probability that Joe is innocent given the auditor detects him cheating: 0.66. (Notice the swap in what comes before and after “given.")

But which probability should I use? Should I follow my hypothesis test and prosecute Joe, or should I follow my counting argument and let him walk free?

One problem with the hypothesis test is that it ignores the base rate: most taxpayers are innocent. Sure, false accusations are rare, but there are lots of non-cheaters to falsely accuse! These false accusations crowd out the true accusations, which are relatively rare because cheating is rare.

In contrast, counting accusees effectively takes the base rate as a prior belief in Joe’s innocence and updates this belief in response to the evidence provided by the auditor. My belief updates a lot—from 0.99 to 0.66—but not enough to indict Joe beyond reasonable doubt. The auditor’s signal is too noisy to establish guilt on its own. (One way to combat this noise is to hire a second auditor, identical to but independent of the first. If both auditors told me Joe cheated then my belief in his innocence would fall to 0.04, which would be much stronger grounds for prosecution.)

However, things can change if my prior belief is incorrect. For example, suppose I think 10% of taxpayers cheat, ten times as many as actually cheat. When the auditor tells me Joe cheated, Bayes’ formula tells me to update my belief in Joe’s innocence from 0.9 to 0.15, which is plausible grounds for prosecution. Now accusee-counting agrees with my hypothesis test, even though my evidence didn’t change. This sensitivity to prior beliefs—which may be incorrect, or may not even exist—is a common criticism of Bayesian inference.

But I like the Bayesian approach. It forces me to remember that data are noisy: the auditor makes mistakes, as do the tools I use to observe and catch data in the wild. This noisiness affects how I should interpret data as evidence of how the world works. Bayesian reasoning also forces me to specify my priors—they’re probably wrong, but specifying them encourages me to think about why they’re wrong (and, hopefully, work to make them less wrong).

I won’t go decrying hypothesis tests any time soon: they’re well-established as the dominant tool in empirical economics, not least because they’re easier to describe and interpret than Bayesian arguments. But I’ll try to “be more Bayesian” generally: to think more carefully about my beliefs, about evidence, and how my beliefs respond to evidence.

Thanks to Anirudh Sankar for reading a draft version of this post. It was inspired by the tenth chapter of Jordan Ellenberg’s How Not to Be Wrong.

Stable matchings with correlated preferences

Fri, 19 Nov 2021 00:00:00 +0000

Suppose I use the Gale-Shapley (GS) algorithm to find a stable matching between two sets $P$ and $R$ of size $n$. Proposer $p\in P$ gets utility $$u_{rp}=\alpha w_r+(1-\alpha)x_{rp}$$ from being matched with reviewer $r\in R$, where $w_r$ is common to all proposers, $x_{rp}$ is specific to proposer $p$, and $\alpha\in[0,1]$ controls the correlation in utilities across proposers.¹ Likewise, reviewer $r$ gets utility $$v_{pr}=\beta y_p+(1-\beta)z_{pr}$$ from being matched with proposer $p$, where $y_p$ is common to all reviewers, $z_{pr}$ is specific to reviewer $r$, and $\beta\in[0,1]$ controls the correlation in utilities across reviewers. The $w_r$, $x_{rp}$, $y_p$, and $z_{pr}$ are iid standard normal. I run the GS algorithm 200 times, each time (i) simulating new utility realizations and (ii) computing the means $$U\equiv\frac{1}{n}\sum_{p\in P}u_{rp}$$ and $$V\equiv\frac{1}{n}\sum_{r\in R}v_{pr}$$ of utilities under the resulting matching. I then compute the grand means of $U$ and $V$ across all 200 simulations. The chart below shows how these grand means vary with $\alpha$ and $\beta$ when $n=50$.

Proposers and reviewers tend to be better off when (i) utilities on their side of the market are less correlated and (ii) utilities on the other side of the market are more correlated. Intuitively, same-side correlations induce competition that makes the most desirable people on that side better off but the rest much worse off. This competition benefits the other side of the market because it gives people on that side more power to choose “winners” according to their preferences.

If $\mathrm{Var}(w_r)=\sigma_w^2$ and $\mathrm{Var}(x_{rp})=\sigma_x^2$ then $\mathrm{Corr}(u_{rp},u_{rq})=[1+(1-\alpha)^2\sigma_x^2/\alpha^2\sigma_w^2]^{-1}$ increases with $\alpha$. ↩︎

Learning from noisy signals

Sat, 23 Oct 2021 00:00:00 +0000

Suppose I want to learn the value of $\omega\in\{0,1\}$. I observe a sequence of iid signals $(s_n)_{n\ge1}$ with $$\Pr(s_n=0\,\vert\,\omega=0)=1-\alpha$$ and $$\Pr(s_n=1\,\vert\,\omega=1)=1-\beta,$$ where $\alpha$ and $\beta$ are false positive and false negative rates. I let $\pi_n$ denote my belief that $\omega=1$ after observing $n$ signals, and update this belief sequentially via Bayes’ formula: $$\pi_{n}(s)=\frac{\Pr(s_n=s\,\vert\,\omega=1)\pi_{n-1}}{\Pr(s_n=s)}.$$ In particular, if I observe $s_n=0$ then I update my belief to $$\pi_n(0)=\frac{\beta\pi_{n-1}}{\beta\pi_{n-1}+(1-\alpha)(1-\pi_{n-1})},$$ whereas if I observe $s_n=1$ then I update my belief to $$\pi_n(1)=\frac{(1-\beta)\pi_{n-1}}{(1-\beta)\pi_{n-1}+\alpha(1-\pi_{n-1})}.$$

The chart below shows how my belief $\pi_n$ changes with $n$. Each path in the chart corresponds to the sequence of beliefs $(\pi_0,\pi_1,\ldots,\pi_{100})$ obtained by updating my initial belief $\pi_0=0.5$ in response to a signal sequence $(s_1,s_2,\ldots,s_{100})$. I simulate 10 such sequences, fixing $\omega=1$ and $\alpha=0.4$ but varying $\beta\in\{0.2,0.4,0.6,0.8\}$.

If $\beta\not=0.6$ then my belief converges to $\pi_n=1$ as $n$ grows. However, if $\beta=0.6$ then $\pi_n=\pi_0$ for each $n$; that is, I never update my beliefs regardless of the signals I observe. This is because if $\alpha+\beta=1$ then $\Pr(s_n=s\cap\omega=1)=\Pr(s_n=s)$ for each $s\in\{0,1\}$, and so signals are uninformative because they are independent of $\omega$.

The chart below plots the mean of my beliefs $\pi_n$ across 1,000 realizations of the signals simulated above. Again, I fix $\omega=1$ and the false positive rate $\alpha=0.4$ but vary the false negative rate $\beta\in\{0.2,0.4,0.6,0.8\}$. Higher values of $\beta$ are not always worse: my belief converges to the truth faster when $\beta=0.8$ than when $\beta=0.4$. Intuitively, if I know the false negative rate is close to 100% then observing a signal $s_n=0$ gives me strong evidence that $\omega=1$.

A competition model of corruption

Tue, 12 Oct 2021 00:00:00 +0000

This post presents a simple model of corruption in two-party elections. The model is similar to one of Cournot competition: parties choose quantities of corruption in response to implicit “price” schedules determined by voter preferences. I describe the model, derive and analyze its equilibrium, provide a numerical example, and discuss some alternatives.

Model

Two parties $A$ and $B$ compete for votes in an electoral system with proportional representation. Each party $k$ chooses its corruption level $c_k$ to maximize $c_ks_k(c_A,c_B)$, where party $k$'s vote share $s_k(c_A,c_B)$ depends on both parties’ chosen corruptions. This objective captures how parties benefit from engaging in corrupt activities, but only insofar as voters give them power to do so.

Voters don’t like corruption: voter $i$'s payoff from voting for $A$ is $(1-c_A+\epsilon_i)$ and their payoff from voting for $B$ is $(1-c_B)$. The $\epsilon_i$ are iid uniformly distributed on $[b-w,b+w]$, where $b$ is the mean bias in favor of party $A$ and $w>0$ controls the noise in voter preferences. Thus, party $A$'s vote share is $$\begin{aligned} s_A(c_A,c_B) &= \Pr(1-c_A+\epsilon_i\ge1-c_B) \\ &= \Pr(\epsilon_i\ge c_A-c_B) \\ &= \frac{b+w-(c_A-c_B)}{(b+w)-(b-w)} \\ &= \frac{1}{2}+\frac{b-(c_A-c_B)}{2w} \end{aligned}$$ while party $B$'s vote share is $$\begin{aligned} s_B &= 1-s_A(c_A,c_B) \\ &= \frac{1}{2}-\frac{b-(c_A-c_B)}{2w}. \end{aligned}$$ Parties $A$ and $B$ engage in a form of Cournot competition: they choose corruptions $c_k$ independently and simultaneously, with full knowledge of the (inverse) demand curves $s_k(c_A,c_B)$. These curves are downward-sloping: the “price” $s_k$, reflecting voters’ willingness to spend their votes on party $k$, falls with the chosen “quantity” $c_k$. Corruptions $c_A$ and $c_B$ are substitutes in the sense that, e.g., the price $s_A$ rises with $c_B$.

Equilibrium

The competition over corruption levels resolves at a Nash equilibrium in which each party chooses optimally given the other party’s choice. For party $A$, this means choosing $c_A^*$ to satisfy the first-order condition $$\newcommand{\parfrac}[2]{\frac{\partial\,#1}{\partial\,#2}} \begin{aligned} 0 &= \parfrac{}{c_A^*}\left(c_A^*\,s_A(c_A^*,c_B)\right) \\ &= \frac{1}{2}+\frac{b-2c_A^*+c_B}{2w}, \end{aligned}$$ which can be rewritten as $$2c_A^*-c_B=w+b.$$ Similarly, the first-order condition for party $B$'s optimal choice $c_B^*$ can be written as $$-c_A+2c_B^*=w-b.$$ Therefore, the Nash equilibrium $(c_A^*,c_B^*)$ levels of corruption satisfy the linear system $$\begin{bmatrix}2&-1\\-1&2\end{bmatrix}\begin{bmatrix}c_A^*\\c_B^*\end{bmatrix}=\begin{bmatrix}w+b\\w-b\end{bmatrix},$$ which has unique solution $$\begin{bmatrix}c_A^*\\c_B^*\end{bmatrix}=\frac{1}{3}\begin{bmatrix}3w+b\\3w-b\end{bmatrix}.$$ Party $A$'s vote share in this equilibrium is $$s_A(c_A^*,c_B^*)=\frac{1}{2}\left(1-\frac{b}{3w}\right),$$ which exceeds 50% if and only if $b$ is negative; that is, when voters are biased against party $A$. In that case, party $A$ can’t “sell” as much corruption as party $B$ because voters aren’t as tolerant of $A$'s corruption as $B$'s. But the price elasticity of corruption is party-invariant, so selling less corruption $c_A^*$ allows party $A$ to obtain a higher price $s_A(c_A^*,c_B^*)$ than party $B$. Nonetheless, both parties obtain the same “corruption revenue” in equilibrium: $$c_k^*s_k(c_A^*,c_B^*)=\frac{w}{2}-\frac{b^2}{18w}.$$

Comparative statics

Differentiating the Nash equilibrium corruption levels $c_A^*$ and $c_B^*$ with respect to the mean bias $b$ gives $$\parfrac{c_A^*}{b}=\frac{1}{3}=-\parfrac{c_B^*}{b},$$ implying that if $b$ increases then party $A$ becomes more corrupt by exactly the amount that party $B$ becomes less corrupt. Indeed, aggregate corruption $c_A^*+c_B^*=2w$ is constant in $b$ but increases with $w$. Both parties become more corrupt (in equilibrium) when $w$ rises: $$\parfrac{c_A^*}{w}=1=\parfrac{c_B^*}{w}.$$ Intuitively, if $w$ rises then voters become less sensitive to corruption because their preferences become noisier. Both parties exploit this fall in sensitivity by becoming more corrupt, which makes them better off because $$\parfrac{}{w}\left(c_k^*\,s_k(c_A^*,c_B^*)\right)=\frac{1}{2}+\frac{b^2}{18w^2}$$ is strictly positive. On the other hand, if $b$ rises then voters become more willing to tolerate party $A$'s corruption and less willing to tolerate party $B$'s. Party $A$ responds to this shift in relative tolerance by selling more corruption, albeit at a lower price $s_A(c_A^*,c_B^*)$.

Numerical example

The Nash equilibrium corruption levels lie at the intersection of party $A$'s best response curve $$c_A^*=\frac{c_B+w+b}{2}$$ and party $B$'s best response curve $$c_B^*=2c_A+w-b,$$ obtained by rearranging the first-order conditions for $c_A^*$ and $c_B^*$. The chart below plots these curves when $b=3$ and $w=5$. The curves intersect at $(c_A^*,c_B^*)=(6,4)$, where party $A$ wins a vote share of $s_A(c_A^*,c_B^*)=40\%$.

Now suppose the mean bias in favor of party $A$ rises to $b=9$. The chart below shows how this rise shifts parties’ best response curves in the $c_Ac_B$ plane. These shifts move the Nash equilibrium rightward to $(c_A^*,c_B^*)=(8,2)$. Party $A$'s vote share falls to $s_A(c_A^*,c_B^*)=20\%$, and both parties’ corruption revenues $c_k^*s_k(c_A^*,c_B^*)$ fall from $2.4$ to $1.6$.

Alternative models

With proportional representation, every vote for party $k$ gives that party more power to engage in corrupt activities. Consequently, the party trades off its corruption level $c_k$ with its vote share $s_k(c_A,c_B)$ continuously. In contrast, if only the party with a majority vote share gains power then corruption revenues become discontinuous in vote shares. This discontinuity changes the equilibrium choices of $c_A$ and $c_B$. For example, if electoral ties are resolved with a coin toss then the unique equilibrium gives each party a 50% vote share independently of $b$ and $w$, and the corruption levels satisfy $c_A^*=c_B^*+b$ (as opposed to $c_A^*=c_B^*+2b/3$ under proportional representation).

One way to generalize the model with proportional representation is to introduce voting blocs: groups of voters with group-specific mean biases $b_j$ and radii $w_j$. Then the equilibrium corruption levels become $$c_A^*=\frac{3+\sum_jb_j\theta_j/w_j}{3\sum_j\theta_j/w_j}$$ and $$c_B^*=\frac{3-\sum_jb_j\theta_j/w_j}{3\sum_j\theta_j/w_j},$$ where $\theta_j$ is group $j$'s share of the population. Intuitively, the equilibrium depends on the aggregate bias and precision of voters’ preferences, but these aggregates depend on the group-specific biases $b_j$ and precisions $1/w_j$ as well as the relative group sizes $\theta_j$. Introducing voting blocs makes the comparative statics more intricate but preserves the underlying intuitions.

Snowball sampling bias in program evaluation

Sat, 04 Sep 2021 00:00:00 +0000

Suppose I want to run a pilot study of a mental health support program before rolling it out at scale. The program has heterogeneous treatment effects, but tends to be more effective for people who have fewer social connections. Such people tend to have lower mental health (Kawachi and Berkman, 2001) and so have more to gain from participating in the program.

I recruit people to my study via snowball sampling: I advertise it to a few initial seeds, who share the ads with their friends, who share the ads with their friends, and so on. Everyone who sees an ad participates. But some people are more likely to see ads than others: in particular, people with more friends have more chances to be sent an ad. Consequently, I will tend to under-estimate the average treatment effect (ATE) of the program because people with more social connections, for whom the program is less effective, are more likely to appear in my pilot sample. Such under-estimation may lead me to abandon the program even if its mental health benefits actually outweigh its implementation costs.

Demonstration

As a concrete example, suppose each individual $i$ has degree $d_i$ in the social network from which I recruit my sample. The treatment effect of the program on individual $i$ is $$\beta_i=1-r\bar{d}_i+(1-r)z_i,$$ where $\bar{d}_i$ is a normalization of $d_i$ with zero mean and unit variance across the network, the $z_i$ are iid standard normal, and $r$ is a parameter controlling the (negative) correlation between the $\beta_i$ and $d_i$. The treatment effects $\beta_i$ give rise to individual-level outcomes $$y_i=\beta_it_i+\epsilon_i,$$ where the $t_i$ are binary treatment indicators and the $\epsilon_i$ are iid standard normal errors. The sample delivers an estimate $$\hat\beta=\frac{\sum_iy_it_i}{\sum_it_i}-\frac{\sum_iy_i(1-t_i)}{\sum_i(1-t_i)}$$ of the program’s ATE: the difference in mean outcomes between treated and untreated members of the pilot sample. Treatments are assigned to sample members randomly. But the sample is recruited non-randomly: individual $i$ is recruited with probability proportional to their degree $d_i$. This non-random recruitment leads to sampling bias when the $\beta_i$ and $d_i$ are correlated.

The chart below summarizes the distribution of ATE estimates across 500 snowball samples of 250 people from a random social network. This network contains 1,000 bilateral friendships among 1,000 people. Network degrees vary between zero and seven, producing variation in the probability of being sampled. I randomize the treatment effects $\beta_i$ and assignments $t_i$ in each simulation run.

The ATE estimate is unbiased when treatment effects are uncorrelated with network degrees. However, the estimate becomes more biased as the correlation becomes stronger. Intuitively, the more the program’s effectiveness is concentrated among low-degree individuals, the worse the program looks in pilot samples excluding those individuals (independently of how treatments are assigned).

Potential solutions

How can we mitigate snowball sampling bias? One approach is to collect information about sample members’ degrees in the social network, and use this information to obtain weighted ATE estimates.¹ The difference-in-means estimator $\hat\beta$ equals the OLS estimator of $\beta$ in the linear model $$y_i=\beta t_i+\varepsilon_i$$ relating outcomes to treatment assignments. Using weighted least squares (WLS) with weights $w_i=1/\sqrt{d_i}$ may deliver less biased estimates by accounting for the probability of sampling each individual $i$. Intuitively, individuals with lower degrees provide relatively more information about the true ATE because they are less likely to be sampled, and so giving these individuals higher weights in the estimation procedure leads to more informed estimates.² However, the distribution of degrees $d_i$ in the sample is different than the distribution of degrees in the full network, and so weighting by the (observed) $d_i$ may still deliver different (and thus incorrect) estimates than weighting by the (unobserved) sampling probabilities.

Another approach, suggested by Jackson et al. (2020), is to model sample recruitment explicitly using game theory. The authors describe a game wherein each individual’s recruitment payoff depends on whether their peers are recruited. The equilibrium of this game determines each individual’s recruitment probability conditional on the network structure (and other covariates). Jackson et al. embed this game in an estimation procedure based on propensity score matching, and show theoretically and empirically that this procedure leads to better ATE estimates.

Thanks to Ryan Brennan for discussing the ideas presented in this post.

This approach is conceptually similar to the “respondent-driven sampling” technique described by Salganik and Heckathorn (2004). ↩︎
Taking square roots recognizes that the objective of WLS is to minimize the weighted sum of squared residuals. ↩︎

Improving human predictions

Tue, 17 Aug 2021 00:00:00 +0000

Chapter 9 of Kahneman et al. (2021) discusses how predictions made by humans can be less accurate than predictions made using statistical models. Part of the chapter describes research by Goldberg (1970) and subsequent authors showing that models of human predictions can out-perform the humans on which those models are based.

For example, suppose I’m asked to make predictions in a range of contexts $i\in\{1,2,\ldots,n\}$. My goal is to use some contextual data $x_i\in\mathbb{R}^k$ to predict the value of a context-specific outcome $y_i$. I generate predictions $$\newcommand{\abs}[1]{\lvert#1\rvert} \bar{y}_i=y_i+u_i,$$ where the $u_i$ are context-specific errors. The accuracy of my predictions can be measured via their mean squared error (MSE) $$\frac{1}{n}\sum_{i=1}^n(\bar{y}_i-y_i)^2=\frac{1}{n}\sum_{i=1}^nu_i^2,$$ where a lower MSE implies higher accuracy. Another way to generate predictions could be to posit a linear model $$y_i=\theta x_i+\epsilon_i,$$ where $\theta$ is a row vector of coefficients and the $\epsilon_i$ are random errors. But I don’t know the true outcomes $y_i$—hence needing to predict them—and so I can’t just use ordinary least squares (OLS) to estimate $\theta$. Instead, Goldberg (1970) suggests replacing this linear model with $$\bar{y}_i=\beta x_i+\varepsilon_i,$$ where $\beta$ is a (possibly different) vector of coefficients and the $\varepsilon_i$ are (possibly different) random errors. This second model describes the linearized relationship between my (possibly incorrect) predictions $\bar{y}_i$ and the data $x_i$ on which those predictions are based. Since I know my predictions $\bar{y}_i$, I can use OLS to obtain an estimate $\hat\beta$ of $\beta$ and produce a set of “modeled predictions” $$\hat{y}_i=\hat\beta x_i.$$ The difference between the $\bar{y}_i$ and $\hat{y}_i$ is that the latter ignore the non-linearities in my method for generating predictions. Intuitively, the $\hat{y}_i$ represent what I would predict using a simple, linear formula; my predictions $\bar{y}_i$ may be generated using a formula that is much more complex, or may not be generated using a formula at all.

So, how do my raw predictions $\bar{y}_i$ and their modeled counterparts $\hat{y}_i$ compare? The chart below plots the $\bar{y}_i$ and $\hat{y}_i$ against the true values $y_i$ when

the $x_i$ and $u_i$ are iid standard normal, and
$y_i=(x_i+z_i)/2$ with $z_i$ iid standard normal.

The modeled predictions are far more accurate: they have an MSE of 0.22, whereas my raw predictions have an MSE of 0.76. In this case, the true relationship between the $y_i$ and $x_i$ is linear, and so a linear model of my predictions is well-placed to out-perform those predictions.

However, modeling predictions does not always improve their accuracy. For example, suppose the contextual data $x_i$ are scalars, and the $x_i$, $y_i$, and $u_i$ have zero means. Then the MSE of the modeled predictions turns out to be $$\frac{1}{n}\sum_{i=1}^n(\hat{y}_i-y_i)^2=\sigma_y^2+\rho_{ux}^2\sigma_u^2-\rho_{xy}^2\sigma_y^2,$$ where $\sigma_y^2$ and $\sigma_u^2$ are the variances of the $y_i$ and $u_i$, where $\rho_{ux}$ is the correlation of the $u_i$ and $x_i$, and where $\rho_{xy}$ is the correlation of the $x_i$ and $y_i$. Consequently, replacing my raw predictions $\bar{y}_i$ with their modeled counterparts $\hat{y}_i$ leads to an accuracy improvement if and only if $$\sigma_y^2(1-\rho_{xy}^2)<\sigma_u^2(1-\rho_{ux}^2).$$ This condition holds in the example plotted above: both $\sigma_u^2$ and $\sigma_y^2$ equal unity, but $\rho_{xy}=0.69$ is much larger in absolute value than $\rho_{ux}=-0.09$. In general, the condition is most likely to hold when

$\sigma_u^2$ is larger than $\sigma_y^2$ (i.e., my raw predictions are relatively noisy);
$\abs{\rho_{xy}}$ is large (i.e., the relationship between the $y_i$ and $x_i$ is approximately linear and deterministic); and
$\abs{\rho_{ux}}$ is small (i.e., the errors $u_i$ in my raw predictions are relatively uncorrelated with the $x_i$).

Intuitively, if the outcomes $y_i$ are a linear function of the $x_i$ (i.e., if $\abs{\rho_{xy}}=1$) then linearizing my predictions improves their accuracy by removing non-linear errors. On the other hand, if my prediction errors $u_i$ are a linear function of the $x_i$ (i.e., if $\abs{\rho_{ux}}=1$) then linearizing my predictions cannot improve their accuracy because there are no non-linear errors to remove.

Dead ends

Wed, 28 Jul 2021 00:00:00 +0000

We can think of research as having two phases:

a “creative phase,” during which researchers search for new ideas, and
a “working phase,” during which researchers exert effort on an idea.

Sadler (2021) views these phases as complementary productive inputs, and analyzes how and why the optimal input mix changes when features of the research environment change. For example, if creativity becomes more expensive then researchers spend more time working on each idea and, in Sadler’s model, tend to work on lower quality ideas. Policymakers can use subsidies and taxes to change the relative costs of creativity and work, thereby influencing how researchers allocate their time.

The core features of Sadler’s model are as follows. A researcher wants to maximize the (sum of discounted) payoffs from their life’s work on research ideas. Different ideas have different qualities, and the researcher knows an idea’s quality when they find it. What they don’t know is whether each idea is feasible: some ideas are “dead ends” in the sense that they cannot generate payoffs, no matter their quality or how long the researcher works on them. The longer the researcher works on an idea without it paying off, the more they start to believe the idea is a dead end. The researcher can act on this belief at any time by abandoning an idea and searching for a new one.

In Sadler’s model, the researcher continues working on an idea if and only if the expected payoff exceeds the sum of two costs: (i) the effort required to work on the idea and (ii) the opportunity cost of not searching for another idea. This opportunity cost depends on the researcher’s discount rate for future payoffs because new ideas take time to find. The expected payoff from working on an idea falls over time, implying that there is a unique amount of time that the researcher spends on each idea before abandoning it. This amount of time is larger for ideas with higher quality.

If effort becomes more costly then the researcher spends less time working on each idea and focuses their effort on higher quality ideas. Intuitively, this rise in effort costs makes the creative phase relatively cheaper, so the researcher substitutes towards it. On the other hand, if the researcher’s discount rate rises (i.e., they become less patient) then they spend more time in the working phase and allow themselves to work on lower quality ideas. This is because the discounted payoff from continuing to work on an idea falls by less than the discounted opportunity cost of searching for a new idea.

Sadler’s model helps explain why organizations with a shorter-term focus (i.e., a higher discount rate) tend to be less innovative: they spend too much time working on low quality ideas, and not enough time searching for high quality ideas. In contrast, organizations that use longer-term incentives, such as stock options and golden parachutes, tend to be more innovative (Lerner and Wulf, 2007; Francis et al., 2011).

Sadler’s model also demonstrates how subsidies and taxes can influence how researchers allocate their time. Subsidizing effort during the working phase lowers the relative cost of that phase, and so researchers spend more time working but tend to work on lower quality ideas. Taxing payoffs raises the quality threshold for abandoning an idea, and so researchers spend less time in the working phase but tend to work on higher quality ideas. The optimal policy depends on the social value of research: the more convex is that value in idea quality, the more society wants researchers to focus on fewer but higher quality ideas, and so the more attractive are taxes relative to subsidies.

nberwp is now on CRAN

Wed, 21 Jul 2021 00:00:00 +0000

nberwp, an R package providing information on NBER working papers and their authors, is now available on CRAN. The current version (1.0.0) covers 29,434 papers published between June 1973 and June 2021. It can be installed via

install.packages('nberwp')

nberwp has evolved since its initial release on GitHub nearly two years ago. This post describes some of the main changes.

More papers

The first version of nberwp covered papers published between June 1973 and December 2018. The updated version adds papers published between January 2019 and June 2021, allowing one to visualize the spike in publications when COVID-19 emerged:

library(dplyr)
library(ggplot2)
library(nberwp)

papers %>%
  count(Quarter = year + (ceiling(month / 3) - 1) / 4, name = 'New papers') %>%
  ggplot(aes(Quarter, `New papers`)) +
  geom_line() +
  labs(title = 'COVID-19 induced a spike in NBER publications',
       subtitle = 'New NBER working papers, by quarter')

nberwp now also includes papers published in the historical and technical working paper series. The historical series contains 136 papers focused on (American) economic history, and the technical series contains 337 papers focused on analytical and empirical methods.

The working paper data exclude duplicates (e.g., papers published in multiple series) but include revisions, which capture continued development of (and collaboration on) research ideas that I believe should be acknowledged.

Program affiliations

The NBER organizes its research into programs, each of which “corresponds loosely to a traditional field of study within economics.” nberwp now provides a table of paper-program correspondences

paper_programs

## # A tibble: 53,996 x 2
##    paper program
##    <chr> <chr>  
##  1 w0074 EFG    
##  2 w0087 IFM    
##  3 w0087 ITI    
##  4 w0107 PE     
##  5 w0116 PE     
##  6 w0117 LS     
##  7 w0129 HE     
##  8 w0131 IFM    
##  9 w0131 ITI    
## 10 w0134 HE     
## # … with 53,986 more rows

as well as a table of program descriptions:

programs

## # A tibble: 21 x 3
##    program program_desc                        program_category   
##    <chr>   <chr>                               <chr>              
##  1 AG      Economics of Aging                  Micro              
##  2 AP      Asset Pricing                       Finance            
##  3 CF      Corporate Finance                   Finance            
##  4 CH      Children                            Micro              
##  5 DAE     Development of the American Economy Micro              
##  6 DEV     Development Economics               Micro              
##  7 ED      Economics of Education              Micro              
##  8 EEE     Environment and Energy Economics    Micro              
##  9 EFG     Economic Fluctuations and Growth    Macro/International
## 10 HC      Health Care                         Micro              
## # … with 11 more rows

The program_category column categorizes programs similarly to Chari and Goldsmith-Pinkham (2017). On average, each paper is affiliated with 1.83 programs and each program has 2,571 affiliated papers.

One use of the paper-program correspondences is to analyze the intellectual overlaps among programs. For example, the table below presents the six pairs of programs with the most-overlapping sets of affiliated papers, with overlap sizes measured by Jaccard indices. The top index of 0.29 means that about 29% of the papers affiliated with the Children or Economics of Education programs are affiliated with both.

Program 1	Program 2	Jaccard index
Children	Economics of Education	0.29
Health Care	Health Economics	0.29
International Finance and Macroeconomics	International Trade and Investment	0.26
Economic Fluctuations and Growth	Monetary Economics	0.23
Asset Pricing	Corporate Finance	0.17
Labor Studies	Public Economics	0.15

Authorships

nberwp now contains information about working papers’ (co-)authors:

authors

## # A tibble: 15,437 x 4
##    author  name             user_nber        user_repec
##    <chr>   <chr>            <chr>            <chr>     
##  1 w0001.1 Finis Welch      finis_welch      <NA>      
##  2 w0002.1 Barry R Chiswick barry_chiswick   pch425    
##  3 w0003.1 Swarnjit S Arora swarnjit_arora   <NA>      
##  4 w0004.1 Lee A Lillard    <NA>             pli669    
##  5 w0005.1 James P Smith    james_smith      psm28     
##  6 w0006.1 Victor Zarnowitz victor_zarnowitz <NA>      
##  7 w0007.1 Lewis C Solmon   <NA>             <NA>      
##  8 w0008.1 Merle Yahr Weiss <NA>             <NA>      
##  9 w0008.2 Robert E Lipsey  robert_lipsey    pli259    
## 10 w0010.1 Paul W Holland   <NA>             <NA>      
## # … with 15,427 more rows

The author column contains unique author identifiers, constructed by concatenating each author’s debut paper and their position on that paper’s (alphabetized) byline. This construction ensures that author values do not change when I add newly published papers to the data. The user_nber column contains authors’ usernames on the NBER website; the user_repec column contains authors’ RePEc IDs. Some authors do not have an NBER username or RePEc ID, indicated by NA values in the appropriate column.

nberwp also provides a table of paper-author correspondences:

paper_authors

## # A tibble: 67,090 x 2
##    paper author 
##    <chr> <chr>  
##  1 w0001 w0001.1
##  2 w0002 w0002.1
##  3 w0003 w0003.1
##  4 w0004 w0004.1
##  5 w0005 w0005.1
##  6 w0006 w0006.1
##  7 w0007 w0007.1
##  8 w0008 w0008.1
##  9 w0008 w0008.2
## 10 w0009 w0004.1
## # … with 67,080 more rows

This table can be used to construct a co-authorship network among the 15,437 authors identified in nberwp. This network currently contains 38,968 edges, implying that 0.03% of pairs co-authored at least one working paper during the period covered by the data. Authors in the network have a mean degree of 5.05.

I used previous versions of nberwp in blog posts on triadic closure and female representation. These posts assumed that authors were uniquely identified by their full names. This assumption was problematic: different authors could share the same name, or a single author could publish under many names (e.g., before and after marriage). The updated version of nberwp builds on previous efforts to disambiguate authors’ names—namely cross-referencing against NBER usernames, RePEc IDs, common co-authorships, and name edit distances—in three ways:

using paper-program correspondences to identify authors who have similar names and published papers in similar programs, and so are likely to be the same person;
manually merging (or splitting) authors whom I determine to be the same (or distinct) based on their personal or academic websites;
including an author ID variable (author) rather than relying on names for unique identification.

These enhancements support cleaner analyses of (co-)authorship behavior. Nonetheless the data may still contain errors—if you find any, let me know by adding an issue on GitHub.

Coefficients of correlated regressors

Wed, 07 Jul 2021 00:00:00 +0000

Linear models cannot be estimated when regressors are perfectly correlated, and their coefficients have large variances when regressors are almost-perfectly correlated. But how does coefficients’ correlation depend on regressors’ correlation?

To answer this question, suppose I have data $(y_i,x_i,z_i)_{i=1}^n$ generated by the process $$\newcommand{\abs}[1]{\lvert#1\rvert} \DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Var}{Var} \renewcommand{\epsilon}{\varepsilon} y_i=\beta_1x_i+\beta_2z_i+\epsilon_i,$$ where the $x_i$ and $z_i$ are normalized to have zero mean and unit variance, and where the $\epsilon_i$ are iid with zero mean and zero correlation with the $x_i$ and $z_i$. If the $x_i$ and $z_i$ are not perfectly correlated then the OLS estimator $\hat\beta$ of the coefficient vector $(\beta_1,\beta_2)$ has variance $$\DeclareMathOperator{\Var}{Var} \Var(\hat\beta)=\frac{\sigma^2}{n(1-\rho^2)}\begin{bmatrix}1&-\rho\\-\rho&1\end{bmatrix},$$ where $\sigma^2$ is the variance of the $\epsilon_i$, and where $\rho$ is the (empirical) correlation of the $x_i$ and $z_i$. It follows that $$\Cor(\hat\beta_1,\hat\beta_2)=-\Cor(x_i,z_i)$$ whenever the $x_i$ and $z_i$ are not perfectly correlated. As their correlation grows, the mean slope of the data in the directions spanned by the $x_i$ and $z_i$ approaches $(\beta_1+\beta_2)$, and so the OLS estimates $\hat\beta_1$ and $\hat\beta_2$ increasingly “compete” for contributions to their sum: if sampling error leads to one coefficient being over-estimated then the other coefficient must be under-estimated to preserve the sum. This competition drives the decreasing correlation of $\hat\beta_1$ and $\hat\beta_2$ as the $x_i$ and $z_i$ become more correlated.

The correlation of the $x_i$ and $z_i$ also determines the precision with which $(\beta_1\pm\beta_2)$ can be estimated. In particular, the expression for $\Var(\hat\beta)$ above implies $$\Var(\hat\beta_1\pm\hat\beta_2)=\frac{2\sigma^2}{n(1\pm\rho)}$$ for $\abs{\rho}<1$. As the $x_i$ and $z_i$ become more correlated (i.e., $\rho$ rises), over-estimates of $\beta_1$ must increasingly coincide with under-estimates of $\beta_2$, and so the estimate of $(\beta_1+\beta_2)$ becomes more precise because the errors cancel out. Conversely, the estimate of $(\beta_1-\beta_2)$ becomes less precise as $\rho$ rises because the errors in $\hat\beta_1$ and $\hat\beta_2$ amplify each other.

One application of this relationship between $\Var(\hat\beta_1\pm\hat\beta_2)$ and $\rho$ is to experimental design. Suppose I want to estimate the effect of receiving two treatments—say, doses of a single vaccine—on some outcome of interest. The $x_i$ and $z_i$ indicate whether individual $i$ receives each dose, the coefficients $\beta_1$ and $\beta_2$ are the average treatment effects (ATEs) of receiving each dose, and the sum $(\beta_1+\beta_2)$ is the ATE of receiving both doses. The most precise estimate of $(\beta_1+\beta_2)$ obtains when the treatments are perfectly positively correlated: that is, when people receive either zero or two doses, but no-one receives only one. Intuitively, I learn more about the effect of receiving two doses from people who receive both than from people who receive only one, so the most informative experiment cannot have anyone who receives a single dose.

On the other hand, suppose I want to compare the effect of two distinct treatments—say, doses of different vaccines—on my outcome of interest. Then I want to estimate $(\beta_1-\beta_2)$, which I can do most precisely when the treatments are perfectly negatively correlated: that is, when people receive one type of vaccine or the other, but no-one receives both. Intuitively, I learn more about the vaccines’ relative effects from people who receive one type than from people who receive both types because the two vaccines may have confounding effects.

Thanks to Lautaro Chittaro for inspiring this post and commenting on a draft.

Rationalizing negative splits

Tue, 18 May 2021 00:00:00 +0000

Many competitive runners aim for negative splits: running the second half of a race faster than the first. A more general goal is to speed up as the race progresses. This post analyzes the conditions under which that goal makes sense. I derive these conditions mathematically, demonstrate them with an example, and discuss some possible extensions to my analysis.

When is speeding up optimal?

Suppose I want to run a unit distance as fast as possible. I choose a speed function $s:[0,1]\to(0,\infty)$ that minimizes my total running time $$\newcommand{\der}{\mathrm{d}} \newcommand{\derfrac}[2]{\frac{\der #1}{\der #2}} \newcommand{\parfrac}[2]{\frac{\partial #1}{\partial #2}} T[s]:=\int_0^1\frac{1}{s(x)}\,\der x,$$ where $x$ indexes distance.¹ However, running uses energy, of which I have a limited supply $e(0)=1$ at the start of my run and which evolves according to $$\parfrac{e(x)}{x}=-r(x,s(x),e(x)),$$ where $r$ determines the rate of energy consumption based on the instantaneous values of $x$, $s(x)$, and $e(x)$. I assume that running faster uses more energy (i.e., $r$ is increasing in $s(x)$) and that I use all of my energy (i.e, $e(1)=0$).

My interest is in how the shape of $s$ depends on the shape of $r$. In particular, I want to know what conditions I have to put on $r$ to make $s$ an increasing function of $x$. I determine these conditions as follows. First, I define the Hamiltonian $$H(x,s(x),e(x),\lambda(x))\equiv-\frac{1}{s(x)}-\lambda(x)r(x,s(x),e(x)),$$ where $\lambda$ is a co-state function. Under some regularity conditions, I can choose the optimizing functions point-wise, so for convenience I let $x\in[0,1]$ be arbitrary and suppress functions’ arguments. Then $s$ and $\lambda$ satisfy the first-order conditions (FOCs) $$\begin{aligned} 0&=H_s=\frac{1}{s^2}-\lambda r_s \\ -\lambda_x&=H_e=-\lambda r_e, \end{aligned}$$ where subscripts denote (partial) differentiation. Differentiating the first FOC with respect to $x$ gives $$\frac{2s_x}{s^3}=-\lambda_xr_s-\lambda r_{sx},$$ which, after substituting back in the two FOCs and dividing by $2\lambda r_s$, becomes $$\frac{s_x}{s}=-\frac{1}{2}\left(r_e+\frac{r_{sx}}{r_s}\right).$$ Thus, if $s_x>0$ then at least one of two conditions on $r$ must hold:

$r_e<0$, which means that I use energy faster when I have less of it;
$r_{sx}/r_s<0$, which, coupled with the assumption that $r_s>0$, means that the energy cost of running fast falls as I cover more distance.

The intuition for the first condition is as follows: energy falls with distance, and if it starts falling faster then I have to start running faster to avoid running out of energy before the finish line. The second condition amplifies this motive to speed up by lowering the cost of running fast as the finish line approaches. I don’t know enough about physiology to know which condition is more plausible, but from experience I’m sympathetic to the second: I’m much less likely to bonk while running if I warm up slowly than if I sprint out of the gate.

A simple example

Suppose I consume energy at the rate $$r(x,s(x),e(x))=(1-ax)s(x)$$ for some parameter $a\in(0,1)$, which determines how the energy cost of running fast changes during my run. That cost is approximately constant when $a\approx0$ and becomes more decreasing in $x$ as $a$ approaches unity. Given this definition of $r$, and given the boundary conditions $e(0)=1$ and $e(1)=0$, the time-minimizing speed and energy profiles are $$s(x)=\frac{2\left(1-(1-a)^{3/2}\right)}{3a\sqrt{1-ax}}$$ and $$e(x)=1-\frac{1-(1-ax)^{3/2}}{1-(1-a)^{3/2}}.$$ Then $s$ is an increasing function of $x$ and becomes more convex as $a$ rises. It turns out that $T[s]=1$ for all $a\in(0,1)$, so varying $a$ preserves the mean speed $1/T[s]=1$ but varies the curvature of $s$ around that mean. More generally, the time $$t(x)\equiv\int_0^x\frac{1}{s(y)}\,\der y$$ taken to run distance $x\in[0,1]$ satisfies $t(x)=1-e(x)$; that is, the proportion of time elapsed always equals the proportion of energy consumed.

The chart below plots $s(x)$ and $t(x)$ when $r=(1-ax)s(x)$. When $a\approx0$, the energy cost of running fast is approximately constant with respect to distance and so the optimal speed profile is approximately flat. As $a$ increases, the cost of running fast increasingly falls with distance and so the optimal speed increasingly rises with distance. Consequently, the percentage of time and energy spent on the first half of the run increases with $a$, starting at 50% when $a\approx0$ and rising to 65% as $a$ approaches unity.

Extensions

One way to extend my analysis could be to make the energy consumption rate stochastic. For example, if I run on unfamiliar terrain then I face uncertainty about upcoming obstacles (e.g., steep hills) and the energy cost of overcoming those obstacles. This uncertainty would encourage me to start my run slowly as a form of precautionary saving, resulting in negative splits.

Another extension could be to model the different energy systems used when running at different speeds. For example, short sprints use the anaerobic system, which burns carbohydrates for fuel, while long slow runs use the aerobic system, which also burns fat for fuel. Adding more energy systems would allow for richer, more realistic dynamics, but would require more domain knowledge than I possess to set up the inter-dependencies correctly.

Thanks to Logan Donald and Florian Fiaux for commenting on a draft version of this post.

I formalize my pacing problem as a “continuous-time” optimal control problem. I consider a discrete-time version of this problem here. ↩︎

Stable matchings with noisy preferences

Sun, 02 May 2021 00:00:00 +0000

My previous post described Gale and Shapley’s (1962) algorithm for solving the stable matching problem. The algorithm delivers a matching between two sets $A$ and $B$ of $n$ people with preferences over matches in the other set.

The Gale-Shapley (GS) algorithm works by letting people in $A$ make proposals to people in $B$, who “tentatively accept” or reject proposals until the matching market clears. Consequently, if one side of the market is more informed about match qualities than the other side then the algorithm could generate different levels of welfare depending on which side makes proposals.

For example, suppose $a\in A$ and $b\in B$ generate surplus $S_{ab}$ from being matched. This surplus has a monetary value (representing, e.g., the price $a$ and $b$ would pay to be matched) so can be aggregated across pairs meaningfully. Both $a$ and $b$ want the match that gives them the greatest surplus. However, they perceive match surpluses noisily: person $a$ thinks their surplus from matching with $b$ is $$S_{ab}^A=S_{ab}+\epsilon_{ab}^A,$$ while $b$ thinks their surplus from matching with $a$ is $$S_{ab}^B=S_{ab}+\epsilon_{ab}^B.$$ The $S_{ab}$ are iid standard normal, the $\epsilon_{ab}^A$ are iid normal with mean zero and variance $\sigma_A^2$, and the $\epsilon_{ab}^B$ are iid normal with mean zero and variance $\sigma_B^2$. Increasing $\sigma_A$ and $\sigma_B$ increases the errors in perceived surpluses. These errors disappear when the matching is made and the “true” surpluses $S_{ab}$ (representing peoples’ true preferences) are realized.

I compare the distribution of mean match surpluses delivered by four matching procedures:

MBM: the maximum-weight bipartite matching based on the true match surpluses $S_{ab}$;
GS-A: the GS algorithm with people in $A$ proposing based on their perceived match surpluses $S_{ab}^A$;
GS-B: the GS algorithm with people in $B$ proposing based on their perceived match surpluses $S_{ab}^B$;
Feasible MBM: the maximum-weight bipartite matching based on the precision-weighted mean perceived match surpluses $$\hat{S}_{ab}=\begin{cases} S_{ab} & \text{if}\ \sigma_A=0\ \text{or}\ \sigma_B=0 \\ \lambda S_{ab}^A+(1-\lambda)S_{ab}^B & \text{otherwise}, \end{cases}$$ where $$\lambda=\frac{1/\sigma_A^2}{1/\sigma_A^2+1/\sigma_B^2}$$ is the relative precision of $A$ members’ perceptions when $\min\{\sigma_A,\sigma_B\}>0$. Feasible MBM replicates MBM when $\min\{\sigma_A,\sigma_B\}=0$.

The MBM procedure maximizes the sum of true match surpluses, while the Feasible MBM procedure maximizes the sum of the best match surplus estimates that people in $A$ and $B$ could obtain by communicating. The GS-A and GS-B procedures do not allow such communication, but guarantee that the ultimate matching is stable. I run all four procedures 1,000 times for $\sigma_A\in\{0,1,5\}$ and $\sigma_B\in\{0,1,5\}$, and summarize my results in the figure below. All four procedures deliver mean match surpluses greater than zero, implying that people tend to do better by following the procedures than by forming matches randomly.

The mean match surpluses delivered by the GS-A, GS-B, and Feasible MBM procedures fall as $\sigma_A$ and $\sigma_B$ rise. Intuitively, these three procedures rely on preferences reported by the people in $A$ and $B$, and if those preferences become noisier then the procedures become worse at finding good matches.

Feasible MBM tends to outperform GS-A and GS-B when $\sigma_A$ or $\sigma_B$ are small. However, the performance gain is neglible when $\sigma_A$ and $\sigma_B$ are large. Intuitively, if perceived match surpluses are mostly noise then sharing that noise doesn’t help with finding better matches.

The GS algorithm tends to find better matches when the people making proposals are the ones with less noisy preferences. Both sides of the matching market provide information that determines the ultimate matching: the proposing side provides information actively through proposals, whereas the non-proposing side provides information passively through proposal acceptances and rejections. Letting the more-informed side make proposals allows more information to feed into the matching process, leading to better matches on average.

Thanks to Spencer Pantoja for inspiring this post and to Al Roth for his comments.

Stable matchings

Mon, 19 Apr 2021 00:00:00 +0000

Let $A$ and $B$ be sets of $n$ people. A “matching” is a collection of pairs $(a,b)$ with $a\in A$ and $b\in B$ such that everyone in $A\cup B$ belongs to exactly one pair. For example, if $A$ and $B$ are sets of men and women then a matching could define a collection of monogamous, heterosexual marriages.

Suppose the people in each set have (complete, strict) preferences over potential matches in the other set. A matching is “stable” if there are no unmatched pairs who prefer each other to their match. Gale and Shapley (1962) show that a stable matching always exists and describe an algorithm for finding it: Let each person $a\in A$ without a match “propose” to their most preferred person in $b\in B$ to whom they haven’t already proposed. If $b$ is unmatched then they tentatively accept the proposal; if $b$ is matched to $a'$ but prefers $a$ then they tentatively accept the proposal and reject $a'$; otherwise, $b$ rejects the proposal. Repeat this process until everyone is matched.

Optimality and strategy-proofness

The Gale-Shapley (GS) algorithm always delivers a stable matching that is best for everyone in $A$ among all stable matchings. To see why, suppose $a\in A$ is matched to $b\in B$ but prefers $b'\in B\setminus\{b\}$. Then $b'$ must have received a proposal from some $a'\in A\setminus\{a\}$ whom they prefer to $a$. Consequently, $a$ cannot form a “blocking pair” with $b'$ (and thereby break the stable matching) because $b'$ would rather be matched to $a'$. Thus $b$ is the best match $a$ can get if the matching is stable.

On the other hand, the GS algorithm always delivers a stable matching that is worst for everyone in $B$ among all stable matchings. To see why, suppose $b\in B$ is matched to $a\in A$ in some matching $\mathcal{M}$ obtained using the GS algorithm. Suppose further that $b$ prefers $a$ to some $a'\in A\setminus\{a\}$ and assume towards a contradiction that there is a stable matching $\mathcal{M}'$ in which $b$ is matched to $a'$. Then $a$ is matched to some $b'\in B\setminus\{b\}$ in $\mathcal{M}'$. Now $\mathcal{M}$ was obtained using the GS algorithm, so it gives $a$ their top preference among all stable matchings. Consequently, $a$ must prefer $b$ to $b'$. But then $a$ and $b$ form a blocking pair in $\mathcal{M}'$, contradicting its stability. Thus $a'$ cannot exist; that is, $a$ is the worst match $b$ can get among all stable matchings.

The GS algorithm is strategy-proof for everyone in $A$: no-one in $A$ can do better by misreporting their preferences (Roth, 1982), nor can any subset of $A$ coordinate to do (strictly) better (Dubins and Freedman, 1981). However, people in $B$ may be able to do better. For example, suppose the preferences among people in $A=\{a_1,a_2,a_3\}$ and $B=\{b_1,b_2,b_3\}$ are given by $$\begin{align*} b_2&\succ_{a_1}b_1\succ_{a_1}b_3 \\ b_1&\succ_{a_2}b_2\succ_{a_2}b_3 \\ b_1&\succ_{a_3}b_2\succ_{a_3}b_3 \\ a_1&\succ_{b_1}a_3\succ_{b_1}a_2 \\ a_3&\succ_{b_2}a_1\succ_{b_2}a_2 \\ a_1&\succ_{b_3}a_2\succ_{b_3}a_3, \end{align*}$$ where $j\succ_ik$ means that $i$ prefers $j$ to $k$. Applying the GS algorithm to these preferences delivers the stable matching $\{(a_1,b_2),(a_2,b_3),(a_3,b_1)\}$. But if $b_1$ misreported their preferences as $a_1\succ_{b_1}a_2\succ_{b_1}a_3$ then the algorithm would deliver $\{(a_1,b_1),(a_2,b_3),(a_3,b_2)\}$, which $b_1$ prefers.

Convergence

Since everyone in $A$ proposes to everyone in $B$ at most once, the GS algorithm never requires more than $n^2$ proposals. However, the algorithm typically requires fewer proposals. For example, suppose the utility $a\in A$ derives from being matched to $b\in B$ is $$U_{ab}=\rho W_b+(1-\rho)X_{ab},$$ where $W_b$ and $X_{ab}$ are iid uniformly distributed on the unit interval $[0,1]$, and where $\rho$ indexes the correlation of match utilities. Similarly, suppose the utility $b$ derives from being matched to $a$ is $$V_{ba}=\rho Y_a+(1-\rho)Z_{ba},$$ where $Y_a$ and $Z_{ba}$ are also iid uniform on $[0,1]$. The utilities $U_{ab}$ and $V_{ba}$ determine peoples’ preferences over matches, and increasing $\rho$ makes those preferences more homogeneous. The chart below shows how the number of proposals required by the GS algorithm covaries with $\rho$ when $n=20$.

On average, more proposals are required when preferences are more homogeneous. Intuitively, increasing $\rho$ makes it more likely that an early tentative acceptance will become a rejection, forcing the rejected person to make another proposal. If $\rho=1$ then the GS algorithm always requires $$\sum_{x=1}^nx=\frac{n(n+1)}{2}$$ proposals. To see why, notice that if $\rho=1$ then everyone in $A$ has the same preferences over everyone in $B$ and vice versa. Consequently, the person in $A$ most preferred by the people in $B$ always gets their first choice, the person in $A$ second-most preferred by the people in $B$ always gets their second choice, and so on. But since everyone in $A$ has the same preferences, each has to make as many proposals as is their position on the (common) preference ordering among the people in $B$.

Limitations

One limitation of the GS algorithm is that it assumes everyone has strict, complete preferences over potential matches. This assumption may not hold in practice: $a\in A$ could be indifferent between $b\in B$ and $b'\in B$, or $a$ may not even know who is in $B$ let alone the utilities derived from being matched to them. Irving (1994) generalizes the GS algorithm to handle situations with indifferences, while Manlove et al. (2002) describe the computational complexity generated by allowing for incomplete preferences.

Another limitation of the GS algorithm is that it always delivers a stable matching that is best for people in $A$ and worst for people in $B$. This “extremal” property of the algorithm’s output motivates alternative algorithms (e.g., those by Roth and Vande Vate (1990) and Romero-Medina (2005), and more recently Dworczak (2021) and Kuvalekar and Romero-Medina (2021)) that deliver ex ante fairer matchings by randomizing whose preferences (i.e., people in $A$ or people in $B$) are used to form matches.

A third limitation is that the GS algorithm assumes match utilities do not depend on the sequence of proposals. In particular, the algorithm assumes that $a\in A$ derives the same utility from being matched to $b\in B$ regardless of how much $b$ wants to be matched to $a$. This assumption seems unrealistic: if I proposed to someone but later learned I was the last person they wanted to marry then that lesson would surely affect my comfort with the proposal. One way to resolve this issue could be to run the algorithm many times, allowing people to revise their preferences at each run based on the matching obtained in the previous run. However, this approach could be expensive—computationally, cognitively, and emotionally—and might not converge if peoples’ preference revisions aren’t well-behaved.

Female representation and collaboration at the NBER

Mon, 29 Mar 2021 00:00:00 +0000

This post analyzes the representation of, and collaboration among, female authors of NBER working papers over the last four decades. My analysis uses paper-author correspondences provided by the R package nberwp.

Estimating sexes

I estimate authors’ sexes using the R package gender, which provides access to historical baby name data from the US Social Security Administration. I focus on baby names between 1940 and 1995 because these roughly correspond to (what I expect are) the birth years of authors who published NBER working papers during the 1980s through 2010s.

Comparing authors’ first names to the frequency of female and male baby names allows me to estimate the probability that each author is female. For example, 3% of babies named Alex between 1940 and 1995 were female, so the estimated probability that an author named Alex is female is 0.03. Rounding each probability to the nearest integer estimates the binary indicator variable for whether each author is female.

The table below reports the number of NBER working papers and authors during the 1980s, 1990s, 2000s, and 2010s. It also reports the percentage of those authors whom I estimate to be female, as well as the percentage of authors whose sexes I can estimate. The number of authors roughly doubled each decade, and the percentage of those authors whom I estimate to be female almost doubled between the 1980s and 2010s.

Decade	Papers	Authors	% authors female	% authors with estimable sex
1980s	2,820	972	14.1	93.9
1990s	4,213	2,211	19.7	88.3
2000s	8,188	5,118	24.0	85.5
2010s	10,970	9,519	27.0	84.1

The percentage of authors with estimable sex is less than 100% because some authors (i) never listed their first names on their papers’ bylines (e.g., always published as “J. Smith”) or (ii) have first names that do not appear in the baby name data. Throughout this post, I assume that conditions (i) and (ii) occur at the same rate for both sexes. Almost all (99.9%) of the authors satisfying either condition satisfy (ii) because they have foreign names. The decrease in sex estimability over time reflects the increase in (co-)authorship of NBER working papers by researchers born outside the United States.

Representation across research programs

The NBER organizes its research into programs, each of which “corresponds loosely to a traditional field of study within economics.” I count the papers associated with each program in the appendix below. The largest programs are Labor Studies, Economic Fluctuations and Growth, and Public Economics, reflecting the NBER’s focus on policy-relevant economic research.

The table below reports the percentage of authors whom I estimate to be female in each of the NBER’s ten largest research programs. I pool the remaining eleven programs into an “Other” program and report separate percentages for each decade. The percentage of female authors grew over time, both overall and within each of the tabulated programs, and was larger in programs that are relatively focused on individual-level outcomes (e.g., Labor Studies and Health Economics). I omit the percentages for Asset Pricing and Corporate Finance in the 1980s because there was only one paper associated with those programs during that decade.

Program	1980s	1990s	2000s	2010s
Labor Studies (LS)	19.1	26.2	27.1	29.9
Economic Fluctuations and Growth (EFG)	5.9	9.2	17.4	18.9
Public Economics (PE)	8.5	16.7	21.9	26.3
International Finance and Macroeconomics (IFM)	14.5	13.8	16.5	20.7
International Trade and Investment (ITI)	14.7	15.9	23.6	23.2
Monetary Economics (ME)	5.3	11.8	13.9	17.5
Asset Pricing (AP)	-	10.7	16.5	18.1
Productivity, Innovation, and Entrepreneurship (PR)	15.4	23.4	22.2	24.1
Corporate Finance (CF)	-	12.9	22.4	20.6
Health Economics (HE)	20.4	23.3	33.9	33.5
Other	11.2	22.7	26.4	28.4
All	14.1	19.7	24.0	27.0

Another way to analyze female representation is to compare the density of female-authored working papers across programs. I present this comparison in the chart below, focusing on papers published during the 2010s. The horizontal axis measures the percentage of working papers published by female authors in each program. I compute these percentages by counting papers “fractionally” so that, for example, papers with two authors and three associated programs contribute a sixth of a paper to the count for each author-program pair. This method avoids double-counting papers across programs and sexes. Aggregating fractional counts by program and sex allows me to estimate the percentage of working papers published in each program by female authors. I order programs by percentage of female authorship and color them according to a categorization based on that used by Chari and Goldsmith-Pinkham (2017).

Overall, females wrote about 21% of the working papers published during the 2010s. These papers were relatively concentrated among programs focused on applied microeconomics rather than on macroeconomics or finance. These patterns echo those presented by Chari and Goldsmith-Pinkham (2017), and could reflect differences in academic culture between different branches of economics (see, e.g., Dupas et al., 2021).

Co-authorship patterns

I infer the collaboration patterns among NBER authors from the working paper co-authorship network for each decade. In each network, nodes correspond to authors who published at least one working paper during that decade, and edges join authors who co-authored at least one working paper during that decade. The table below summarizes each network. The networks grew larger and less dense over time, while the rise in mean degree—that is, the mean number of co-authors—reflects the rise in co-authorship among economists documented in other studies (e.g., Rath and Wohlrabe, 2017).

Decade	Nodes	Edges	Edge density (%)	Mean degree
1980s	972	1,197	0.25	2.46
1990s	2,211	3,062	0.13	2.77
2000s	5,118	8,890	0.07	3.47
2010s	9,519	21,455	0.05	4.51

The figure below compares the co-authorship network degree distributions for each sex. Females tended to have fewer co-authors than males, but the mean difference was small and fell over time (from 0.78 during the 1980s to 0.66 during the 2010s).

The next three tables describe structural properties of each decade’s co-authorship network based on authors’ estimated sexes. These properties may be sensitive to estimation errors. Therefore, rather than report point estimates for each property, I report 95% confidence intervals obtained using the following bootstrap procedure:

Randomly assign each author to be female according to the probabilities obtained from the baby name data.
Compute each structural property under the randomized assignment.
Repeat the preceding two steps 1,000 times to obtain bootstrap distributions of each property.
Use the 2.5% and 97.5% quantiles of the bootstrap distributions as the lower and upper confidence bounds.

The first property I examine is the clustering coefficient: the probability that two authors were co-authors given that they shared a common co-author. The table below compares the clustering coefficient of the full co-authorship network in each decade with the clustering coefficient of the sub-networks induced by the sets of authors whom I estimate to be female and male.

Clustering coefficient	1980s	1990s	2000s	2010s
Overall	0.17	0.18	0.21	0.24
Among females (95% CI)	(0.39, 0.50)	(0.41, 0.50)	(0.30, 0.35)	(0.32, 0.35)
Among males (95% CI)	(0.16, 0.17)	(0.17, 0.17)	(0.20, 0.21)	(0.23, 0.23)

The female sub-networks were much more clustered than the full and male networks. Such clustering suggests a stronger tendency among females to close triads by collaborating with other females with whom they share a common (female) co-author. The decline in clustering among females over time could reflect the rise in between-sex co-authorship: the percentage of co-authored papers with at least one author of each sex was about 16% in the 1980s, and rose to 25%, 35%, and 42% in the subsequent three decades.

The next property I examine is the assortativity coefficient, which measures the extent to which authors tended to co-author with members of the same sex. The coefficient equals 1 when there is perfect sorting (i.e., no between-sex edges), −1 when there is perfect dis-sorting (i.e., no within-sex edges), and 0 when there is no sorting (i.e., the network is “as random”). The table below shows that each network’s assortativity coefficient was positive, implying that within-sex co-authorship was more common than we would expect if co-authorships were random.

Decade	Assort. coeff. (95% CI)
1980s	(0.05, 0.09)
1990s	(0.08, 0.11)
2000s	(0.07, 0.09)
2010s	(0.08, 0.10)

Computing assortativity coefficients across all programs may mask program-specific patterns. I explore these patterns in my final table below, which reports 95% confidence intervals for the assortativity coefficient of the co-authorship network within each of the NBER’s ten largest research programs. I label programs by their abbreviations so that the table is not too wide.

Program	1980s	1990s	2000s	2010s
LS	(0.16, 0.27)	(0.13, 0.19)	(0.05, 0.09)	(0.09, 0.11)
EFG	(-0.07, 0.08)	(-0.07, 0.01)	(-0.02, 0.02)	(0.02, 0.06)
PE	(0.04, 0.14)	(-0.01, 0.05)	(0.03, 0.07)	(0.05, 0.07)
IFM	(-0.05, 0.04)	(-0.01, 0.08)	(-0.01, 0.05)	(0.03, 0.08)
ITI	(-0.06, 0.04)	(0.01, 0.09)	(0.00, 0.07)	(0.05, 0.10)
ME	(-0.07, 0.04)	(-0.03, 0.06)	(-0.10, -0.03)	(0.03, 0.09)
AP	-	(-0.06, 0.07)	(-0.01, 0.05)	(0.00, 0.05)
PR	(-0.15, -0.01)	(0.12, 0.22)	(0.02, 0.09)	(0.07, 0.11)
CF	-	(-0.04, 0.07)	(-0.03, 0.04)	(0.03, 0.09)
HE	(-0.14, -0.01)	(0.01, 0.09)	(0.01, 0.05)	(0.07, 0.10)
All	(0.05, 0.09)	(0.08, 0.11)	(0.07, 0.09)	(0.08, 0.10)

The network among authors in the Labor Studies (LS) program became less sorted over time, whereas the network among authors in the Health Economics (HE) program became more sorted over time. But the representation of women in both of those programs grew over time, suggesting that the mechanisms promoting female representation were different than the mechanisms promoting female collaboration. It would be interesting to explore these mechanisms further, but I’ll leave that for a future post.

Acknowledgements

Thanks to Mohamad Adhami, Florencia Hnilo and Akhila Kovvuri for reading draft versions of this post.

Appendix

The table below (fractionally) counts working papers by program and decade. I present programs in decreasing order of associated papers across all four decades.

Program	1980s	1990s	2000s	2010s
Labor Studies (LS)	454	635	868	1,081
Economic Fluctuations and Growth (EFG)	458	471	921	1,083
Public Economics (PE)	445	557	827	993
International Finance and Macroeconomics (IFM)	374	466	731	662
International Trade and Investment (ITI)	370	517	631	525
Monetary Economics (ME)	418	327	389	514
Asset Pricing (AP)	0	221	610	627
Productivity, Innovation, and Entrepreneurship (PR)	96	231	371	563
Corporate Finance (CF)	1	131	436	560
Health Economics (HE)	82	115	355	497
Development of the American Economy (DAE)	45	78	311	379
Industrial Organization (IO)	0	82	300	432
Economics of Aging (AG)	33	126	237	340
Health Care (HC)	0	100	248	316
Environment and Energy Economics (EEE)	1	6	138	483
Economics of Education (ED)	0	1	209	411
Children (CH)	2	35	246	297
Political Economics (POL)	0	0	141	415
Law and Economics (LE)	20	57	188	231
Development Economics (DEV)	0	0	0	462
Technical Working Papers (TWP)	0	0	25	95
None	24	58	7	0
Total	2,820	4,213	8,188	10,970

Monopoly equilibrium in insurance markets

Fri, 19 Feb 2021 00:00:00 +0000

This post shows how monopoly insurance pricing can lead to inefficient risk sharing. I describe a mathematical model of the monopoly equilibrium, present a numerical example, and discuss some limitations of my analysis.

Model

Suppose I have initial wealth $w_0$ and suffer a loss of size $L$ with probability $p$. I can buy $c\in[0,L]$ units of insurance coverage at per-unit price $\lambda p$, where $\lambda\ge1$ is a loading factor set by my insurer. I choose the amount of coverage $c^*$ that maximizes my expected utility $$EU(c)\equiv(1-p)u(w_0-\lambda p c)+pu(w_0-\lambda pc-L+c),$$ where $$u(w)\equiv-\frac{1}{a}\exp(-aw)$$ is my utility function and $a>0$ is my coefficient of absolute risk aversion. Solving the first-order condition for $c^*$ gives $$c^*=L-\frac{1}{a}\log\left(\frac{\lambda(1-p)}{1-\lambda p}\right),$$ which equals $L$ when $\lambda=1$ (i.e, the premium is actuarially fair) and equals zero when $\lambda$ equals $$\lambda_{\text{max}}=\frac{1}{p+(1-p)\exp(-aL)}.$$ This limiting value of $\lambda$ approaches one as $aL$ approaches zero—I won’t buy insurance if I am risk neutral or face no risk—and is always less than $1/p$. For $\lambda\in(1,\lambda_{\text{max}})$, the slope $$\newcommand{\parfrac}[2]{\frac{\partial #1}{\partial #2}} \parfrac{c^*}{\lambda}=-\frac{1}{a\lambda(1-\lambda p)}$$ of my inverse demand curve is strictly decreasing, implying that I view insurance as an ordinary good.

Now suppose my insurer knows my demand for coverage $c^*\equiv C(\lambda)$ given the loading factor $\lambda$, as well as the other parameters in my choice environment. Then they can choose $\lambda$ to maximize their expected profit $$\pi(\lambda)\equiv(\lambda-1)pC(\lambda),$$ which equals the premium I pay minus the expected cost of indemnifying me. If $L>0$ then the profit-maximizing loading factor $\lambda^*$ is strictly between one and $\lambda_{\text{max}}$, and setting $\lambda=\lambda^*$ gives my insurer positive expected profit. But then I demand partial coverage $C(\lambda^*)<L$, which is allocatively inefficient because I am risk averse but my insurer is risk neutral: having the insurer bear more of my risk would make me better off but my insurer no worse off. Consequently, we suffer a deadweight loss relative to the equilibrium in which my insurer sets $\lambda=1$, I demand full coverage, and my insurer bears all of my risk.

Numerical example

The figure below describes the monopoly equilibrium when $w_0=100$, $L=20$, $p=0.2$, and $a=0.2$. My insurer best-responds to my demand schedule (the downward-sloping curve) by setting the loading factor equal to $\lambda^*=3.26$, which earns them expected profit $\pi=4.49$. At the price $\lambda^* p=0.65$, I buy $c^*=9.94$ units of coverage and enjoy $$p\int_{\lambda^*}^{\lambda_{\text{max}}}C(\lambda)\,\mathrm{d}\lambda=1.68$$

units of consumer surplus. In contrast, at the actuarially fair price $p$ I would have bought full coverage, and although my insurer would have made zero expected profit we would have avoided the deadweight loss of 2.14 generated by our inefficient risk-sharing arrangement at the monopoly equilibrium.

One way to make sense of these numbers is to compute the certainty-equivalent wealth $$CE(\lambda)=u^{-1}(EU(C(\lambda)))$$ that, if held with certainty, would give me as much utility as I expect to enjoy if I buy $C(\lambda)$ units of coverage at per-unit price $\lambda p$. Buying insurance at the monopoly equilibrium price raises my certainty equivalent wealth by $CE(\lambda^*)-CE(\lambda_{\text{max}})=1.68$, the consumer surplus I enjoy at that equilibrium. Making the premium actuarially fair would further raise my certainty-equivalent wealth by $CE(1)-CE(\lambda^*)=6.63$ but lower my insurer’s expected profit by $\pi(\lambda^*)=4.49$; the sum of our surpluses would rise by $6.63-4.49=2.14$, the deadweight loss at the monopoly equilibrium.

The chart below presents some comparative statics of the monopoly equilibrium. I maintain the parameters $w_0=100$ and $L=20$ from above, but vary my risk aversion coefficient $a$ and the probability $p$ with which I incur the loss.

My insurer sets a higher loading factor and earns more profit when my risk aversion rises. This is because the mixed partial derivative $$\parfrac{^2c^*}{\lambda\partial a}=\frac{1}{a^2\lambda(1-\lambda p)}$$ is strictly positive, which means that my demand is less sensitive to price changes when $a$ is high. My insurer exploits this lower sensitivity by charging me higher prices. When $a$ is small, this exploitation moves us away from the actuarially fair equilibrium and so raises the deadweight loss; when $a$ is large, I want to buy a lot of insurance despite its high price, and so the deadweight loss is small because having the insurer bear my risk is allocatively efficient.

On the other hand, my insurer sets a lower loading factor when the probability of loss rises. This is because the mixed partial derivative $$\parfrac{^2c^*}{\lambda\partial p}=-\frac{\lambda}{\alpha(1-\lambda p)^2}$$ is strictly negative, which means that my demand is more sensitive to price changes when $p$ is high. My insurer responds to this sensitivity by forfeiting some of its monopoly power, moving us closer to the actuarially fair equilibrium and lowering the deadweight loss.

Limitations

One issue with my analysis is the assumption that I have exponential utility, which implies that my tolerance for, and demand for insurance against, additive risks does not depend on how rich I am. Under this assumption, I am equally willing to pay for insurance to avoid a $10 loss when I have $10 as I am when I have $10 million, which seems implausible. I could instead assume that I have isoelastic utility $$u(w)\equiv\frac{w^{1-r}-1}{1-r}$$ for some $r>0$, which would imply that my willingness to pay for insurance falls as I become richer. However, replacing exponential with isoelastic utility in the plots above delivers qualitatively identical patterns.

Another issue is the supposition that the insurer knows my demand schedule. In reality, my insurer would have imperfect information about my utility function and the parameters of my choice environment, and so would not know my inverse demand function $C(\lambda)$. But they could estimate $C(\lambda)$ by, for example, asking how much insurance I would buy at a range of prices. They would have to be clever to prevent me from over-reporting my price-sensitivity in an attempt to get cheaper coverage, but I’m sure real-world insurers have solved this problem (at least approximately) given their financial incentives.

Dyadic dependence

Wed, 10 Feb 2021 00:00:00 +0000

Let $[n]\equiv\{1,2,\ldots,n\}$ be a set of individuals. Suppose I have data $\{(y_{ij},x_{ij}):i,j\in[n]\ \text{with}\ i<j\}$ on pairs in $[n]$ generated by the process $$\renewcommand{\epsilon}{\varepsilon} y_{ij}=x_{ij}\beta+\epsilon_{ij},$$ where $x_{ij}$ is a row vector of pair $\{i,j\}$'s characteristics, $\beta$ is a vector of coefficients to be estimated, and $\epsilon_{ij}$ is a random error term with zero mean and zero correlation with the $x_{ij}$. For example, $[n]$ could be the nodes in a network, $x_{ij}$ the dimensions along which nodes $i$ and $j$ interact, and $y_{ij}$ the outcome of such interaction.

We can rewrite the data-generating process (DGP) in matrix form as $$y=X\beta+\epsilon,$$ where $y$ is the vector of outcomes, $X$ is the design matrix, and $\epsilon$ is the vector of errors. Here $X$ has $$N\equiv\frac{n(n-1)}{2}$$ rows, each corresponding to a(n unordered) pair of individuals in $[n]$. Since the $x_{ij}$ and $\epsilon_{ij}$ are uncorrelated, the ordinary least squares estimator $$\hat\beta=(X^T\!X)^{-1}X^T\!y$$ of $\beta$ is unbiased. However, $\hat\beta$ may not be efficient because the errors $\epsilon_{ij}$ may be correlated. For example, if $$\epsilon_{ij}=u_i+u_j+v_{ij}$$ with $u_i$, $u_j$, and $v_{ij}$ independent then $$\DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Var}{Var} \Cov(\epsilon_{ij},\epsilon_{jk})=\Var(u_j).$$ Intuitively, the pairs $\{i,j\}$ and $\{j,k\}$ are linked through individual $j$, and so any errors specific to that individual affect the errors for both pairs. Consequently, the homoskedastic estimator $$\widehat{\Var}_{\text{Hom.}}(\hat\beta)=\hat\sigma^2(X^T\!X)^{-1}$$ with $$\hat\sigma^2=\frac{1}{N}\sum_{ij}\hat\epsilon_{ij}^2$$ and $$\hat\epsilon_{ij}=y_{ij}-x_{ij}\hat\beta$$ will typically under-estimate the variance in $\hat\beta$ by failing to account for linked pairs having dependent errors.

So, how can we account for such dependence? Consider the “sandwich” form $$\Var(\hat\beta)=BMB$$ of the (co)variance matrix for $\hat\beta$, where $B=(X^T\!X)^{-1}$ is the “bread” matrix and $M=X^T\!VX$ is the “meat” matrix with $V=\Var(\epsilon)$ the error (co)variance matrix. We need to estimate $M$ because we don’t observe the $\epsilon_{ij}$. Indexing pairs by $p$, the homoskedastic estimator defined above uses $$\begin{align} \hat{M}_{\text{Hom.}} &= \hat\sigma^2X^T\!X \\ &= \hat\sigma^2\sum_{p=1}^Nx_p^T\!x_p, \end{align}$$ which assumes all errors have equal variance. In contrast, White (1980) suggests using $$\begin{align} \hat{M}_{\text{White}} &= X^T\!\mathrm{diag}\left(\hat\epsilon_p^2\right)X \\ &= \sum_{p=1}^N\hat\epsilon_p^2x_p^T\!x_p, \end{align}$$ which allows for unequal error variances (heteroskedasticity). But neither $\hat{M}_{\text{Hom.}}$ nor $\hat{M}_{\text{White}}$ allow for dyadic dependence among the errors. To that end, Aronow et al. (2017) suggest augmenting White’s estimator via $$\begin{align} \hat{M}_{\text{Aronow}} &= \hat{M}_{\text{White}}+\sum_{p=1}^N\sum_{q\in\mathcal{D}(p)}\hat\epsilon_p\hat\epsilon_qx_p^T\!x_q, \end{align}$$ where $\mathcal{D}(p)$ is the set of pairs $q\not=p$ linked to $p$ by a shared individual. We can express $\hat{M}_{\text{Aronow}}$ in matrix form as $$\hat{M}_{\text{Aronow}}=X^T\!\left(D\odot\hat\epsilon\hat\epsilon^T\!\right)X,$$ where $D=(d_{pq})$ is the dyadic dependence matrix with $$d_{pq}=\begin{cases} 1 & \text{if pairs}\ p\ \text{and}\ q\ \text{are linked}\\ 0 & \text{otherwise}, \end{cases}$$ and where $\odot$ denotes element-wise multiplication. Aronow et al. show that, under mild conditions, $B\hat{M}_{\text{Aronow}}B$ is a consistent estimator for $\Var(\hat\beta)$ when the data exhibit dyadic dependence.¹

To see Aronow et al.‘s estimator in action, suppose the DGP is given by the system $$\begin{align} y_{ij} &= \beta x_{ij}+\epsilon_{ij} \\ x_{ij} &= z_i+z_j \\ \epsilon_{ij} &= u_i+u_j+v_{ij}, \end{align}$$ where $z_i$, $z_j$, $u_i$, $u_j$ and $v_{ij}$ are iid standard normal, and $\beta=1$ is the (scalar) coefficient to be estimated. Both the $x_{ij}$ and the $\epsilon_{ij}$ exhibit dyadic dependence, so we expect the homoskedastic and White estimators to under-estimate the true variance in $\hat\beta$. Indeed, the box plots below show that Aronow et al.‘s estimator is less biased than the homoskedastic and White estimators, and gets more accurate as the number of individuals $n$ grows.

Aronow et al.‘s estimator can also be applied to generalized linear models. For example, suppose $$y_{ij}=\begin{cases} 1 & \text{if nodes}\ i\ \text{and}\ j\ \text{are adjacent} \\ 0 & \text{otherwise} \end{cases}$$ is an indicator for the event in which nodes $i$ and $j$ are adjacent in a network. We can model the link formation process as $$\Pr(y_{ij}=1)=\Lambda^{-1}(x_{ij}\beta+\epsilon_{ij}),$$ where $\Lambda(x)\equiv\log(x/(1-x))$ is the logit link function. The logistic regression estimate $\hat\beta$ of $\beta$ reveals how the observable characteristics $x_{ij}$ of nodes $i$ and $j$ determine their probability of being adjacent. We can estimate the variance of $\hat\beta$ consistently by letting $\hat{P}_{ij}=\Lambda^{-1}(x_{ij}\hat\beta)$ be the predicted probability for pair $\{i,j\}$, replacing the bread matrix $B=(X^T\!X)^{-1}$ with $$\hat{B}=\left(X^T\mathrm{diag}\left(\hat{P}_{ij}\left(1-\hat{P}_{ij}\right)\right)X\right)^{-1},$$ and computing $\hat{B}\hat{M}_{\text{Aronow}}\hat{B}$. My co-authors and I use this approach in “Research Funding and Collaboration:” we estimate how grant proposal outcomes determine the probability with which pairs of researchers co-author, and we compare $\hat\sigma^2\hat{B}$ and $\hat{B}\hat{M}_{\text{Aronow}}\hat{B}$ to show that our inferences are robust to dyadic dependence.

Fafchamps and Gubert (2007) describe a similar variance estimator to Aronow et al. but do not establish its consistency. ↩︎

Assortative mixing

Tue, 02 Feb 2021 00:00:00 +0000

Let $N$ be a network with $n$ nodes, each of which has a “type” belonging to some set $T$. We say that $N$ is “assortatively mixed” if nodes tend to have the same types as their neighbors. For example, if $N$ is a social network and $T$ is a set of interests, then assortative mixing could arise because friends tend to share interests.

How can we measure the extent of assortative mixing in $N$? Newman (2003) suggests the “assortativity coefficient” $$r=\frac{\sum_{t\in T}x_{tt}-\sum_{t\in T}y_t^2}{1-\sum_{t\in T}y_t^2},$$ where $x_{st}$ is the proportion of edges joining nodes of type $s$ to nodes of type $t$, and where $$y_t=\sum_{s\in T}x_{st}$$ is the proportion of edges incident with nodes of type $t$. The coefficient $r$ varies between -1 and 1, and takes larger values when $N$ is more assortatively mixed. We say that $N$ is “positively sorted” if $r>0$ and “negatively sorted” if $r<0$.

We can interpret $r$ by thinking about the “mixing matrix” $X=(x_{st})$. The numerator of $r$ equals the sum of diagonal entries of $X$ minus what that sum would be if the distributions of entries across rows and columns were independent. The denominator of $r$ is a normalizing constant ensuring $\lvert r\rvert\le1$. Thus $r$ indexes the frequency of within-type edges in $N$ relative to the frequency we would expect in a random network with the same proportion of edges incident with each type.

As an example, suppose $N$ is a realization of the planted partition model with $n_1$ nodes of type 1, $n_2=n-n_1$ nodes of type 2, and some proportion $$p_{st}=\begin{cases} p & \text{if}\ s=t \\ q & \text{otherwise} \end{cases}$$ of edges joining nodes of type $s$ to nodes of type $t$. Then $N$ has assortativity coefficient $$r=\frac{p^2(n_1-1)(n_2-1)-q^2n_1n_2}{p^2(n_1-1)(n_2-1)+pq(n_1(n_1-1)+n_2(n_2-1))+q^2n_1n_2},$$ which equals -1 if $p=0$ and $q>0$ (i.e., there are no within-type edges), and equals 1 if $p>0$ and $q=0$ (i.e., there are no between-type edges). If $p=q$ then $$r=-\frac{1}{n-1},$$ which converges to zero from below as $n$ becomes large. Intuitively, if $p=q$ then within-type and between-type edges occur at the same rate, but the network is slightly negatively sorted because there are slightly fewer potential within-type edges than potential between-type edges.

If $n_1=n_2$ then $$r=\frac{p^2(n-2)-q^2n}{p^2(n-2)+q^2n},$$ which converges to $(p^2-q^2)/(p^2+q^2)$ as $n$ becomes large. The figure below demonstrates this case with $n_1=n_2=25$. The network on the left has edge frequencies $(p,q)=(0.15,0.02)$ and assortativity coefficient $r=0.75$; the network on the right has edge frequencies $(p,q)=(0.02,0.15)$ and assortativity coefficient $r=-0.79$. Both networks are drawn so that adjacent nodes are closer together. Nodes in the positively sorted network tend to have neighbors with the same type, while nodes in the negatively sorted network tend to have neighbors with a different type.

The assortativity coefficient $r$ can be used when $T$ is a set of categorical types. In contrast, if $T$ is set of scalar quantities then we can measure the extent of assortative mixing via the Pearson correlation coefficient $$\DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{Var} \DeclareMathOperator{\Cov}{Cov} \rho=\frac{\Cov(t_i,t_j)}{\sqrt{\Var(t_i)\Var(t_j)}},$$ where $t_i\in T$ and $t_j\in T$ are the “types” of nodes $i$ and $j$, and where (co)variances are computed with respect to the frequency at which nodes of type $t_i$ and $t_j$ are adjacent in the network.¹ To see how this works, let $A=(a_{ij})$ be the $n\times n$ adjacency matrix for $N$ and let $W=(w_{ij})$ be the $n\times n$ “weighting matrix” with entries $$w_{ij}=\frac{a_{ij}}{\lvert\rvert A\rvert\rvert},$$ where $\lvert\rvert A\rvert\rvert$ denotes the sum of elements in $A$. Then the vector $t=(t_1,t_2,\ldots,t_n)$ of node types has mean $$\E[t]=s^Tt,$$ where $s=(s_1,s_2,\ldots,s_n)$ is the vector of row sums $$s_i=\sum_{j=1}^nw_{ij}.$$ Intuitively, $s$ describes the probability mass function for the (marginal) distribution of node types. Treating $t_i$ and $t_j$ as draws from this distribution, we have $$\begin{align*} \Cov(t_i,t_j) &= \E[t_it_j]-\E[t_i]\E[t_j] \\ &= \sum_{i=1}^n\sum_{j=1}^nw_{ij}t_it_j-(s^Tt)(s^Tt) \\ &= t^TWt-(s^Tt)^2 \end{align*}$$ and similarly $$\begin{align*} \Var(t_i) &= \E[t_i^2]-\E[t_i]^2 \\ &= \sum_{i=1}^ns_it_i^2-(s^Tt)^2 \\ &= t^TSt-(s^Tt)^2, \end{align*}$$ where $S$ is the $n\times n$ matrix with principal diagonal equal to $s$ and off-diagonal entries equal to zero. Then $$\rho=\frac{t^TWt-(s^Tt)^2}{t^TSt-(s^Tt)^2}.$$ For example, if the nodes in $N$ are arranged such that $$a_{ij}=\begin{cases}1 & \text{if}\ t_i=t_j \\ 0 & \text{otherwise} \end{cases}$$ then $$\begin{align*} t^TWt &= \sum_{i=1}^n\sum_{j=1}^nw_{ij}t_it_j \\ &= \sum_{i=1}^nt_i^2\sum_{j=1}^nw_{ij} \\ &= \sum_{i=1}^nt_i^2s_i \\ &= t^TSt \end{align*}$$ and so $\rho=1$—that is, if all adjacent nodes have the same scalar type then the coefficient $\rho$ obtains its maximum value of unity.

One common use of the correlation coefficient $\rho$ is to measure assortativity with respect to nodes’ degrees (see, e.g., Newman, 2002). For example, the left-hand network in the figure above has $\rho=0.03$: although nodes are sorted strongly by color, they are approximately unsorted by degree because the planted partition model from which the network is generated has no mechanism for connecting high-degree nodes. Performing a degree-preserving randomization of the network changes its assortativity with respect to nodes’ degrees by changing the joint distribution of those degrees across node pairs:

Numerical experimentation suggests $r=\rho$ whenever $\lvert T\rvert=2$, which I prove here. ↩︎

Ordinary and total least squares

Mon, 11 Jan 2021 00:00:00 +0000

Suppose $X$ and $Y$ are random variables with $$\DeclareMathOperator{\E}{E} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\Var}{Var} \newcommand{\abs}[1]{\lvert#1\rvert} Y=\beta X+u,$$ where $u$ has zero mean and zero correlation with $X$. The coefficient $\beta$ can be estimated by collecting data $(Y_i,X_i)_{i=1}^n$ and regressing the $Y_i$ on the $X_i$. Now suppose our data collection procedure is flawed: instead of observing $X_i$, we observe $Z_i=X_i+v_i$, where the $v_i$ are iid with zero mean and zero correlation with the $X_i$. Then the ordinary least squares (OLS) estimate $\hat\beta_{\text{OLS}}$ of $\beta$ obtained by regressing the $Y_i$ on the $Z_i$ suffers from attenuation bias: $$\begin{align*} \DeclareMathOperator*{\plim}{plim} \plim_{n\to\infty}\hat\beta_{\text{OLS}} &=\frac{\Cov(Y,Z)}{\Var(Z)} \\ &=\frac{\Cov(\beta X+u,X+v)}{\Var(X+v)} \\ &= \frac{\beta\Var(X)}{\Var(X)+\Var(v)} \\ &= \frac{\beta}{1+\Var(v)/\Var(X)} \end{align*}$$ and so $\abs{\hat\beta_{\text{OLS}}}<\abs{\beta}$ asympotically whenever $\Var(v)>0$. Intuitively, the measurement errors $v_i$ spread out the independent variable, flattening the fitted regression line.

One way to reduce attenuation bias is to replace OLS with total least squares (TLS), which accounts for noise in the dependent and independent variables. As a demonstration, the chart below compares the OLS and TLS lines of best fit through some randomly generated data $(Y_i,Z_i)_{i=1}^n$ with $\beta=1$. The OLS estimate $\hat\beta_{\text{OLS}}=0.43$ minimizes the sum of squared vertical deviations of the data from the fitted line. In contrast, the TLS estimate $\hat\beta_{\text{TLS}}=0.95$ minimizes the sum of squared perpendicular deviations of the data from the fitted line. For these data, the TLS estimate is unbiased because $u$ and $v$ have the same variance.

However, if $u$ and $v$ have different variances then the TLS estimate of $\beta$ is biased. I demonstrate this phenomenon in the chart below, which compares the OLS and TLS estimates of $\beta=1$ for varying $\Var(u)$ and $\Var(v)$ when $X$ is standard normal. I plot the bias $\E[\hat\beta-\beta]$ and mean squared error $\E[(\hat\beta-\beta)^2]$ of each estimate $\hat\beta\in\{\hat\beta_{\text{OLS}},\hat\beta_{\text{TLS}}\}$, obtained by simulating the data-generating process 100 times for each $(\Var(u),\Var(v))$ pair.

If $\Var(u)>\Var(v)$ then the TLS estimate $\hat\beta_{\text{TLS}}$ is biased upward because the data are relatively stretched vertically; if $\Var(u)<\Var(v)$ then $\hat\beta_{\text{TLS}}$ is biased downward because the data are relatively stretched horizontally. The OLS estimate is biased downward whenever $\Var(u)>0$ due to attenuation. The TLS estimate is less biased and has smaller mean squared error than the OLS estimate when $\Var(u)<\Var(v)$, suggesting that TLS generates “better” estimates than OLS when the measurement errors $v_i$ are relatively large.

One problem with TLS estimates is that they depend on the units in which variables are measured. For example, suppose $Y_i$ is person $i$'s weight and $Z_i$ is their height. If I measure $Y_i$ in pounds, generate a TLS estimate $\hat\beta_{\text{TLS}}$, use this estimate to predict the weight in pounds of someone six feet tall, and then convert my prediction to kilograms, I get a different result than if I had measured $Y_i$ in kilograms initially. This unit-dependence arises because rescaling the dependent variable affects each perpendicular deviation differently.

In contrast, OLS-based predictions do not depend on the units in which I measure $Y_i$. Rescaling the dependent variable multiplies each vertical deviation by the same constant, leaving the squared deviation-minimizing coefficient unchanged.

Auctioning vaccines

Thu, 17 Dec 2020 00:00:00 +0000

Pancs (2020) proposes an auction for vaccines in which people can bid on others’ behalf. This format allows people to internalize the externalities they enjoy from their peers being vaccinated.

For example, suppose there are two vaccines to be allocated among agents A–H, who are connected socially via the network shown below.

Everyone submits bids totaling $60, spread evenly among themselves and their peers. For example, agent A bids $30 towards vaccinating themself and agent B, while agent B bids $15 towards vaccinating themself and agents A, C, and D. Intuitively, agent A values vaccinating B highly because it protects A fully from viruses transmitted among agents C–H. In contrast, B has more peers and so values vaccinating any one of those peers less because it doesn’t protect B fully from the rest of the network.

The “aggregate bid” for each agent equals the sum of bids submitted towards that agent’s vaccination. The agents with the highest aggregate bids receive the vaccines. In this example, agents B and F receive the vaccine, with aggregate bids equal to $94 and $87.

Each agent receives surplus equal to their subjective valuation of the vaccine allocation minus their payment towards that allocation’s provision. This payment equals the increase in aggregate surplus that other agents would receive if the agent’s bids were ignored. Thus, the vaccine auction is a type of Vickrey-Clarke-Groves (VCG) auction in which each agent pays the harm they inflict on other agents. Consequently, the vaccine auction inherits the properties of VCG auctions; in particular, bids equal subjective valuations. This property makes it easy to compute pre-payment surpluses: simply sum each agent’s bids towards vaccinated agents.

The table below presents the aggregate bid for, payment made by, and surplus delivered to each agent under the optimal vaccine allocation. Agents B and F don’t have to pay for the vaccines they receive because others are willing to pay on their behalf. Agent A pays $15 because their bid towards vaccinating B shifts the optimal allocation away from E, which lowers F’s surplus by $15. Likewise, agents G and H pay because their preference to vaccinate F, rather than E, makes B–D worse off.

Agent	Aggregate bid ($)	Payment ($)	Surplus ($)
A	42	15	15
B	94	0	12
C	44	0	20
D	44	0	20
E	79	0	24
F	87	0	15
G	45	22	8
H	45	22	8

This example departs from reality in two important ways. First, I assume each agent’s bids sum to a constant ($60). This assumption is obviously unrealistic: wealth inequality means some people can afford to submit higher bids than others, which may lead to inequitible vaccine allocations. Moreover, people may vary in their willingness to pay for vaccines independently of the variation in their wealths.

Second, I assume every agent wants to be vaccinated. This common desire may not hold in reality: some people may prefer not to be vaccinated because they fear potential side-effects. Such people may refuse to participate in the auction, reducing social welfare by preventing some externalities from being internalized.

Gift exchange mechanisms

Sun, 13 Dec 2020 00:00:00 +0000

Last December I compared strategies for playing white elephant, a game in which people take turns either unwrapping a gift or stealing a previously unwrapped gift. It turned out that players’ best strategy was to be “greedy” by stealing the most subjectively valuable unwrapped gift. Intuitively, this strategy helps players obtain the gift they want most, provided no other players also want that gift and steal it later in the game.

White elephant exchanges are a fun, but not necessarily optimal, way to match people with gifts. Another way is to use the top trading cycle (TTC) algorithm:

Give everyone a random unwrapped gift.
Ask everyone to point at the most subjectively valuable gift (which may be their own).
If there is a closed cycle of people pointing at each others’ gifts, give everyone in that cycle the gift at which they’re pointing, and remove those people and gifts from consideration.
If there are no gifts remaining then stop. Otherwise, return to step 2.

The allocation delivered by this algorithm has several desirable properties. First, it is Pareto efficient: every cycle identifies a mutually beneficial exchange, and the algorithm stops when no such exchanges remain. Second, it is strategy-proof: people cannot get better gifts by lying about their preferences (see Roth, 1982). Third, it is core-stable: no group of people can cooperate to improve their allocations, for otherwise they would have formed a cycle before the algorithm stopped.

However, the TTC algorithm may not deliver the allocation that maximizes the sum of gifts’ subjective values. This allocation corresponds to a maximum-weight matching in the bipartite graph connecting people to gifts, with each edge’s weight equal to the incident player’s subjective value of the incident gift.¹

The chart below compares the mean subjective value of the gifts allocated using a game of white elephant, using the TTC algorithm, and by finding a maximum-weight matching. I compute these allocations as follows. First, I define person $i$'s subjective value of gift $j$ as $$V_{ij}=\rho X_j+(1-\rho)Y_{ij},$$ where $X_i$ and $Y_{ij}$ are iid uniformly distributed on the unit interval. The parameter $\rho$ determines the correlation of gifts’ subjective values across people: if $\rho=0$ then everyone’s valuations are independent, whereas if $\rho=1$ then everyone has the same valuation of each gift. For a range of $\rho$ values, I simulate 100 valuation sets $\{V_{ij}:i,j\in\{1,2,\ldots,30\}\}$, and apply each gift exchange mechanism to each set. In the white elephant games, I assume all players adopt the greedy strategy described above unless the best unwrapped gift has subjective value less than $\mathrm{E}[V_{ij}]=0.5$, in which case players unwrap a new gift.

All three gift exchange mechanisms get worse as gifts’ subjective values become more correlated. Intuitively, as the correlation increases, there are fewer Pareto-improving trades and so people get stuck with their random endowments.² The allocations delivered via white elephant games and the TTC algorithm have similar allocative efficiencies, even though white elephant players can’t assign subjective values to gifts until they are unwrapped.

Yet white elephant games are much more popular at Christmas parties than the TTC algorithm. One explanation could be that the algorithm tends to reveal a lot of information about peoples’ preferences and, in particular, may make people more upset about contributing a gift no-one wants. I justify this claim in the following chart, which plots the number of times someone rejects each gift for another in my simulated exchanges. For example, I add one to gift A’s rejection count if

a white elephant player could steal gift A but instead steals gift B, or
I’m running the TTC algorithm and someone could point at gift A but instead points at gift B.

Intuitively, these rejection events reveal that gift A has subjectively lower value than other gifts, and the more often this happens the more likely is the person who contributed gift A to feel bad about their contribution.

Most Christmas parties set a target amount to be spent on each gift, so—to the extent that cost correlates positively with value—the empirically relevant region of the chart is where the correlation of subjective values is high. In this region, running the TTC algorithm tends to generate many more rejection events than running a game of white elephant. Intuitively, if the correlation of subjective values is high then people will tend to all point at the same gifts, there will be fewer cycles, more iterations will be required before the TTC algorithm stops, and hence the algorithm will force people to reveal more about their preferences as the market slowly clears. On the other hand, the unobservability of wrapped gifts’ subjective values means that white elephant players have fewer opportunities to reveal their preferences, regardless of whether those preferences are shared by other players.

Thanks to Mohamad Adhami, Nick Cao, and Spencer Pantoja for commenting on a draft version of this post.

The maximum-weight matching is hard to find in practice because it requires complete information about peoples’ preferences. In contrast, white elephant games and the TTC algorithm elicit peoples’ preferences by asking them to choose explicitly which gifts they want. ↩︎
In white elephant games, the randomness comes from the order in which people take their turns choosing whether to unwrap or steal. ↩︎

Aggregating preferences: Bird of the Year edition

Sun, 29 Nov 2020 00:00:00 +0000

Earlier this month the kākāpō was elected Bird of the Year for 2020. The news prompted me to review the results of last year’s election, in which the kākāpō lost narrowly to the yellow-eyed penguin. In particular, I wanted to determine whether the 2019 results were sensitive to the method used to aggregate voters’ preferences. This post summarises my findings: different methods deliver (slightly) different outcomes, and at least one method would have crowned the kākāpō.

Bird of the Year elections run as follows. Each voter selects up to five birds, ranks their selections in order of preference, and submits their ranking on the election website. These submissions determine the winning bird via the instant-runoff (IR) method:

Count the ballots on which each bird is ranked first.
If one bird is ranked first on a majority of ballots then elect it. Otherwise, eliminate the bird ranked first on the fewest ballots and return to step 1.

Using the IR method, rather than a plurality vote (in which the bird listed first on the most ballots wins), mitigates vote-splitting because voters can list multiple birds on their ballots. However, the IR method violates the Condorcet criterion: a bird may lose the election even if it would beat every other bird in a head-to-head plurality vote. One way to satisfy this criterion is to use Copeland’s method, which ranks birds by the number of pairwise plurality votes they win minus the number of such votes they lose.

The IR method and Copeland’s method both rely on noiseless within-ballot rankings. I suspect this property does not hold for Bird of the Year elections. After selecting up to five birds, voters are asked to rearrange their selections from most to least preferred before submitting their ballots. It seems likely that this rearrangement does not occur, either because voters can’t be bothered or because they are approximately indifferent among their selections. In either case, voters’ preferences might be better aggregated using an approval-based system: each bird earns one point for each ballot appearance, and the bird with the most points wins.

One obvious problem with the approval-based system is that voters may approve of more than five birds, but cannot signal such approval because the “up to five” constraint binds. On the other hand, some voters may feel obliged to list five birds on their ballots even if they approve of only four birds or fewer.¹ The most defensible way to deal with these possibilities seems (to me) to be to use a plurality vote, which assumes the minimal completeness of voters’ individual preferences by treating only their first choices as informative.²

The table below presents the top-placing birds in the 2019 election using the IR method, and those birds’ places under the other preference aggregation methods described above. The kākāpō was actually the Condorcet winner; it would have beaten every other bird in a head-to-head plurality vote. Nevertheless the IR method crowned the yellow-eyed penguin, as would have the approval-based system and a simple plurality vote.

Bird	IR place	Copeland place	Approval place	Plurality place
Yellow-eyed penguin	1	4	1	1
Kākāpō	2	1	2	2
Black Robin	3	2	3	5
Banded Dotterel	4	8	5	3
Fantail	5	12	9	4
New Zealand Falcon	6	10	10	9
Kererū	7	11	11	8
Blue Duck	8	9	8	7
Kea	9	6	6	10
Kākā	10	3	4	11

The figure below compares all candidate birds’ places using the IR method to their places obtained using the alternative methods. The IR method delivers results most similar to a plurality vote and least similar to Copeland’s method, as shown by the relative deviations of points from the 45-degree line. These patterns suggest that voters’ second through fifth choices for Bird of the Year didn’t affect the 2019 election outcome materially.

Of the 43,460 ballots cast in last year’s election, 91.3% listed five birds, 1.4% listed four birds, 1.2% listed three birds, 0.8% listed two birds, and 5.2% listed one bird. ↩︎
Nominating a “first choice” requires only that a voter can identify at least one bird that they prefer to at least one other bird. ↩︎

Polarized beliefs in social networks

Thu, 29 Oct 2020 00:00:00 +0000

Suppose 50 people each have four friends. Everyone believes that some proposition—say, “corporate tax rates should be higher”—is either true or false, with equal probability and independently of everyone else. Consequently, the social network among the 50 people is unsorted with respect to peoples’ beliefs. However, the network’s structure changes over time, in discrete time steps, according to two rules:

everyone updates their belief to match the majority within their friend group (comprised of themselves and their neighbours in the network), defaulting to their previous belief to break ties;
edges appear between people who hold the same belief and disappear between people who hold different beliefs, both with probability 0.01.

The first rule describes a “social learning” process: people update their beliefs to match the majority among their friends.¹ The second rule describes a “peer selection” process: people choose friends who share the same beliefs. These two processes can lead to polarized beliefs, even if there is no polarization before the processes begin. I demonstrate this phenomenon in the figure below, which plots the beliefs and connections in a simulated network after zero, 10, 20, and 30 time steps. The figure shows how people grow increasingly connected to others with the same belief and decreasingly connected to others with the opposing belief.

The social learning and peer selection processes can lead to polarization both together and separately. I justify this claim in the figure below. The left-hand panel plots the network’s assortativity coefficient, which measures the overall correlation among friends’ beliefs. This coefficient equals one when all neighbours share the same beliefs (complete polarization) and equals zero when edges are “as random.” The right-hand panel plots the proportion of people in the network who update their belief at each time step. Both panels present means and 95% confidence intervals across 30 simulated networks, each with randomized initial beliefs.

The social learning process leads to positive sorting because, by construction, people increasingly share the same beliefs as their friends. The peer selection process leads to positive sorting because, by construction, edges increasingly connect people with common beliefs only. The two processes work together to isolate the subnetworks of people who believe the proposition is true and false. Interestingly, most belief updates occur very early: after about five time steps, most of the structural changes in the social network result from edge creations and deletions rather than from belief updates.

See my blog post on DeGroot learning for more discussion of social learning processes. ↩︎

Estimating sensitive parameters

Wed, 21 Oct 2020 00:00:00 +0000

Suppose some proportion $\theta$ of the population engages in a socially undesirable activity—say, evading taxes. We want to estimate $\theta$, but can’t ask people directly because they may fear penalities from incriminating themselves.

One solution to this problem is as follows. Choose another characteristic that people don’t mind reporting and for which we know the population prevalence—say, whether they are right-handed. Let $\alpha$ be the (assumedly known) proportion of the population with this characteristic. Sample $n$ people, and give them the following instructions:

Flip a fair coin, but don’t tell me what you get. If you get heads, answer the question “do you evade taxes?” If you get tails, answer the question “are you right-handed?”

The coin toss outcome’s unobservability shields respondents’ revelation of tax evasion—they could be responding “Yes” to the question of whether they are right-handed. This shield, hopefully, elicits truthful reporting. Then, by the Law of Total Probability, the probability that someone responds “Yes” is $$p=\frac{\theta+\alpha}{2}.$$ Let $X$ be the number of people who respond “Yes.” Then $X$ is Binomially distributed with $n$ trials and success rate $p$, and so has mean $\mathrm{E}[X]=np$ and variance $\mathrm{Var}(X)=np(1-p)$. Consequently, the estimator $$\hat\theta_n=2\frac{X}{n}-\alpha$$ of $\theta$ has mean $\mathrm{E}[\hat\theta_n]=\theta$ and variance $$\begin{align*} \mathrm{Var}(\hat\theta_n) &= \frac{4}{n^2}\mathrm{Var}(X) \\ &= \frac{4p(1-p)}{n} \\ &\le \frac{1}{n} \end{align*}$$ since $4p(1-p)\le1$ for any $p\in[0,1]$. Thus, $\hat\theta_n$ is an unbiased estimator of $\theta$ and becomes more precise as the sample size $n$ grows. We can quantify this precision using Chebyshev’s inequality: for any $\varepsilon>0$, we have $$\Pr(\lvert\hat\theta_n-\theta\rvert\ge\varepsilon)\le\frac{\mathrm{Var}(\hat\theta_n)}{\varepsilon^2}$$ and therefore $$\Pr(\lvert\hat\theta_n-\theta\rvert<\varepsilon)\ge1-\frac{1}{n\varepsilon^2}.$$ Thus, for example, choosing $n\ge4000$ guarantees that $\hat\theta_n$ differs from $\theta$ by no more than $\varepsilon=0.05$ with probability 0.9.

Research funding and collaboration

Mon, 12 Oct 2020 00:00:00 +0000

Research is increasingly conducted by teams. Consequently, there is growing interest in the mechanisms underlying research team formation. In a new NBER working paper, my co-authors and I explore one potential mechanism: participation in research funding contests. Such contests may promote collaboration for several reasons:

They require proposal team members to invest resources in planning collaborative projects;
They may help researchers screen for productive collaborators;
If better ideas are more likely to win funding then success signals that researchers’ shared ideas are worth pursuing.

These arguments suggest that the members of more successful proposal teams are more likely to become co-authors. We test this hypothesis empirically, using data from New Zealand. Our data include Scopus publication records on New Zealand researchers and their international co-authors. We link these records to data on applications to the Marsden Fund, the premier source of funding for basic research in New Zealand.

In our data, researchers with more successful Marsden Fund applications tended to have more co-authors. However, this tendency may be driven by confounding factors, such as researchers’ ability to generate publishable research. We control for such factors by analysing co-authorship dynamics econometrically. Specifically, we use dyadic regression to estimate how the probability that pairs of researchers co-author in a given year varies with their observable characteristics. Pairs were more likely to co-author in a given year if

they had co-authored with each other recently,
they co-authored with others often,
they published in similar fields,
their prior publications attracted more citations, or
their prior citation histories differed.

The fifth bullet implies negative assortative mixing among the researchers in our data, which we suspect arises due to inter-generational collaboration (e.g., professors working with graduate students and post-docs).

On average, pairs were 13.8 percentage points more likely to co-author in a given year if they co-submitted Marsden Fund proposals during the previous ten years than if they did not. This co-authorship rate was not significantly larger among pairs who received funding. However, increasing the lag between our outcome and explanatory variables delivers the opposite result: funding receipt, rather than proposal submission, promotes co-authorship. As discussed in our paper, these patterns suggest that the “treatment effect” of research funding contest participation on co-authorship is limited to successful participants only.

Our analysis has both technical and policy implications. On the technical side, we discuss some empirical problems that arise when analysing co-authorship networks, offer solutions to these problems, and discuss how these solutions affect our inferences. On the policy side, we show how science funding schemes can influence how researchers choose collaborators, which may have long-term effects on how science and innovation systems evolve.

Relatedness, complexity and local growth redux

Thu, 10 Sep 2020 00:00:00 +0000

“Relatedness, Complexity and Local Growth,” co-authored with Dave Maré while I worked at Motu, has undergone peer review. A revised version was published online today and will appear in a future issue of Regional Studies.

Dave and I present a measure of the relatedness between economic activities that is more robust to noisy employment data than measures used in previous studies (e.g., Balland et al., 2019; Hidalgo et al., 2007; Rigby et al., 2019). We demonstrate this robustness using historical census data from New Zealand. We also demonstrate that relatedness patterns do not significantly influence the employment dynamics described by those data.

Our analysis suggests that the principle of relatedness applies in large geographic areas only. In our New Zealand data, the benefits of proximity are more apparent in larger cities, where workers engaged in related activities interact more frequently. Our paper highlights some of the challenges with operationalising place-based regional growth and innovation policies, such as the “smart specialisation” policies adopted in the European Union.

Read the published article (available under Open Access) for more details.

COVID-19, lockdown and two-sided uncertainty

Fri, 21 Aug 2020 00:00:00 +0000

When the COVID-19 pandemic began, the New Zealand government faced uncertainty around the virus’ health and economic consequences. Amid this uncertainty, the government had two choices: enter lockdown immediately or delay its decision. Delaying preserved the option to enter lockdown if its necessity became clearer. However, a delayed lockdown would be less effective if many people caught the virus while the government waited for clarifying information.

We know now that the government chose to enter lockdown early. Was this the best choice given information available at the time? To help answer this question, Arthur Grimes and I analyse the government’s decision in an article published last week in the New Zealand Economic Papers. Our analysis formalises, and builds on, ideas discussed in my blog post on policymaking under uncertainty and Arthur’s commentary on the lockdown at Newsroom.

Arthur and I present a two-period model of the government’s choice problem. In the first period, the government decides whether to enter lockdown given random future health and economic outcomes. These outcomes are realised in the second period, at which time the government decides whether to maintain or reverse its initial decision. That initial decision influences the joint probability distribution of health and economic outcomes, and the payoffs associated with each choice in the second period. The government’s decision rule in the first period is to choose the policy that generates the greatest net expected payoff, given the dynamic consequences of the policy chosen.

We allow payoffs to vary with a parameter capturing the government’s aversion to health risks vis-à-vis economic risks. The chart below shows how this parameter affects the payoff from each choice available in the first period. As health risk aversion rises, the government increasingly prefers policies that insure against bad health outcomes. Consequently, the value of entering lockdown rises while the value of delaying falls. The non-linearity in the payoff curves reflects the non-linearity of health and economic costs under each policy choice: delaying lockdown suppresses economic costs but exposes the government to potentially exponential health costs if the virus spreads rampantly.

See Arthur and my article, “COVID-19, lockdown and two-sided uncertainty,” for further discussion.

Lessons from Dave Maré

Sun, 16 Aug 2020 00:00:00 +0000

Last week I finished up at Motu, an economic research institute where I worked for two and half years. During that time I learned a lot from Dave Maré, who taught me several techniques for conducting rigorous, intellectually honest empirical research. This post describes three such techniques: stating your predictions, having weak priors and strong nulls, and killing off the variation.

State your predictions

The scientific method involves stating hypotheses before testing them. Dave encourages this practice at a smaller scale: before plotting figures or printing regression tables, write down what you expect to see.

Stating your predictions forces you to think about how and why variables might be related. For example, if I regressed workers’ wages on their years of education, I would expect to estimate a positive coefficient because education provides knowledge and skills that make people more employable. Likewise, if I could control for natural ability then I would expect the coefficient on education to decrease because I would remove some endogeneity bias. Forming these expectations (and their justifications) in advance makes my priors explicit, making them easier to revise when confronted with new evidence. It also insures against ex post rationalisations of the empirical patterns.

Stating your predictions also means you have two independent data sources—your predictions and your figures/tables—that you can compare to identify and correct mistakes. For example, if I estimated a negative relationship between education and wages, I would want to make sure the disagreement between my intuition and my estimate was not due to errant definitions of the variables in my data.

Have weak priors and strong nulls

Priors are beliefs held before gathering new evidence. In empirical research, we usually derive priors from intuitive or logical reasoning (e.g., “education provides knowledge and skills that make people more employable”). However, the world is more complicated than can be described by intuition and logic; people behave in unexpected and unpredictable ways. Consequently, our priors can be incorrect or incomplete. To have “weak priors” is to acknowledge such ignorance and to let your beliefs be guided by empirical evidence rather than by fallible reasoning.

However, empirical evidence comes in varying strengths. To have “strong nulls” is to graduate from “ignorant” to “informed” only when supplied with strong evidence. For example, if significant relationships persist after controlling for potentially confounding factors then those relationships are likely to reflect the true data-generating process.

Kill off the variation

Empirical models describe relationships between variables. These relationships may not be first-order: the mechanisms that we think operate, and that our models aim to capture, may not be central to the stories playing out in our data. To determine the centrality of our hypothesised mechanisms, Dave suggests trying to “kill off the variation:” add explanatory variables until the coefficients on our covariates of interest become insignificant.

For example, in “Relatedness, Complexity and Local Growth,” Dave and I analyse the relationship between local activity growth rates and several covariates that capture the prevalence of local employee interactions. In theory, such interactions foster the growth of “complex” activities that build on existing local strengths. However, in our data, most of the variation in local activity growth is explained by the growth experienced by the city and activity as a whole, and our chosen covariates provide no additional explanatory power. Thus, while employee interactions may influence employment dynamics at the margin, such interactions are not central to the story of how New Zealand cities evolved during our period of study.

Product-maximising partitions

Wed, 08 Jul 2020 00:00:00 +0000

Let $\newcommand{\N}{\mathbb{N}}\N=\{1,2,\ldots\}$ be the set of positive integers. A partition of $n\in\N$ is a way of writing $n$ as a sum of positive integers, called parts. For example, $1+2+3$ is a partition of $6$, with parts $1$, $2$, and $3$. Partitions are unique up to rearrangement: $1+2+3$ and $3+2+1$ are the same partition, but $1+2+3$ and $3+3$ are different partitions.

This post discusses the following problem:

Let $n\ge2$ be a positive integer. Find a partition of $n$ whose parts have maximum product.

For example, the parts in $1+2+3$ have product $1\times2\times3=6$, while the parts in $3+3$ have product $3\times3=9$. Our goal is to find a product-maximising partition for arbitrary $n$.

Let $x_1+x_2+\cdots+x_k$ be a partition of $n$. If $x_1=1$ then $k\ge2$ (since $n\ge2$) and $$\begin{align} \prod_{i=1}^kx_i &= 1\times x_2\times\prod_{i=3}^kx_i \\ &< (1+x_2)\times\prod_{i=3}^kx_i \end{align}$$ because the $x_i$ are strictly positive.¹ Thus, replacing the partition $x_1+x_2+\cdots+x_k$ with $(1+x_2)+x_3+\cdots+x_k$ delivers a greater product. Since the $x_i$ can be rearranged arbitrarily, it follows that product-maximising partitions contain no parts equal to one. Similarly, if $x_1>4$ then $$\begin{align} \prod_{i=1}^kx_i &= x_1\times\prod_{i=2}^kx_i \\ &< 3(x_1-3)\times\prod_{i=2}^kx_i, \end{align}$$ so we can obtain a greater product by replacing $x_1+x_2+\cdots+x_k$ with $3+(x_1-3)+x_2+\cdots+x_k$. It follows that product-maximising partitions contain no parts greater than four. But $2\times2=4$ and $2+2=4$, so we can replace each four with two twos without reducing the parts’ product. Thus, we can obtain a product-maximising partition using only twos and threes. Finally, if a partition contains three twos then we should replace them with two threes, since $2+2+2=3+3$ but $2^3=8<9=3^2$.

To summarise, we can obtain a product-maximising partition using only twos and threes, with as many threes as possible. Letting $n=3q+r$ for some $q\in\N\cup\{0\}$ and $r\in\{0,1,2\}$, the maximum product we can obtain is $$P(n)=\begin{cases}3^q&\text{if}\ r=0\\ 2^2\times3^{q-1}&\text{if}\ r=1\\ 2\times3^q&\text{if}\ r=2.\end{cases}$$ We can approximate this solution by relaxing the integrality constraint on the $x_i$. For any given $k$, we can find the vector $x^*$ that solves $$\newcommand{\R}{\mathbb{R}}\max_{x\in\R_+^k}\prod_{i=1}^kx_i\ \text{subject to}\ \sum_{i=1}^kx_i=n \tag{1},$$ where $\R_+$ is the set of positive real numbers. This vector has $x_i^*=n/k$ for each $i\in\{1,2,\ldots,k\}$, so that $\prod_{i=1}^kx_i^*=(n/k)^k$.² If there was no integrality constraint on $k$ then we could maximise $(n/k)^k$ by choosing $k=n/e$, where $e\approx2.718$ is Euler’s constant. But $k$ must be an integer, so we should round it to the nearest integer in whatever direction delivers the greatest value of $(n/k)^k$. Doing so delivers an estimate $$\hat{P}(n)=\max\left\{\left(\frac{n}{\lfloor n/e\rfloor}\right)^{\lfloor n/e\rfloor},\left(\frac{n}{\lceil n/e\rceil}\right)^{\lceil n/e\rceil}\right\}$$ of $P(n)$, where $x\mapsto\lfloor x\rfloor$ and $x\mapsto\lceil x\rceil$ are the floor (“round down”) and ceiling (“round up”) functions.

The table below compares $P(n)$ and $\hat{P}(n)$ for various $n$. Since $\{2,3\}\subset\mathbb{R}_+$, the partition of $n$ using twos and as many threes as possible is a feasible, but not necessarily optimal, solution to $(1)$. Thus $P(n)\le\hat{P}(n)$ for each $n\ge2$. The multiplicative error $\hat{P}(n)/P(n)$ grows exponentially with $n$ because the exponent $k\in\{\lfloor n/e\rfloor,\lceil n/e\rceil\}$ of $(n/k)^k$ grows (increasingly linearly) with $n$, amplifying the error in the approximation $n/k\sim e$ to each part in the partition underlying $P(n)$.

`$n$`	`$P(n)$`	`$\hat{P}(n)$`	`$\hat{P}(n)/P(n)$`
2	2	2	1.00
3	3	3	1.00
4	4	4	1.00
5	6	6.25	1.04
10	36	39.06	1.09
50	8.61×10⁷	9.70×10⁷	1.13
100	7.41×10¹⁵	9.47×10¹⁵	1.28
500	3.19×10⁷⁹	7.66×10⁷⁹	2.40
1,000	1.01×10¹⁵⁹	5.86×10¹⁵⁹	5.78

If $j>k$ then $\prod_{i=j}^kx_i=1$ by convention. ↩︎
One can derive $x_i^*=n/k$ using the method of Lagrange multipliers. ↩︎

Understanding selection bias

Fri, 03 Jul 2020 00:00:00 +0000

Suppose we have data $\{(x_i,y_i):i\in\{1,2,\ldots,n\}\}$ generated by the process $$y_i=\beta x_i+u_i,$$ where the $u_i$ are random errors with zero means, equal variances, and zero correlations with the $x_i$. This data generating process (DGP) satisfies the Gauss-Markov assumptions, so we can obtain an unbiased estimate $\hat\beta$ of the coefficient $\beta$ using ordinary least squares (OLS).

Now suppose we restrict our data to observations with $x_i\ge0$ or $y_i\ge0$. How will these restrictions change $\hat\beta$?

To investigate, let’s create some toy data:

library(dplyr)

n <- 100
set.seed(0)
df <- tibble(x = rnorm(n), u = rnorm(n), y = x + u)

Here $x_i$ and $u_i$ are standard normal random variables, and $y_i=x_i+u_i$ for each observation $i\in\{1,2,\ldots,100\}$. Thus $\beta=1$. The OLS estimate of $\beta$ is $$\DeclareMathOperator{\Cov}{Cov}\DeclareMathOperator{\Var}{Var}\hat\beta=\frac{\Cov(x,y)}{\Var(x)},$$ where $x=(x_1,x_2,\ldots,x_{100})$ and $y=(y_1,y_2,\ldots,y_{100})$ are data vectors, $\Cov$ is the covariance operator, and $\Var$ is the variance operator. For these data, we have

cov(df$x, df$y) / var(df$x)

## [1] 1.138795

as our estimate with no selection.

Next, let’s introduce our selection criteria:

df <- df %>%
  tidyr::crossing(criterion = c('x >= 0', 'y >= 0')) %>%
  rowwise() %>%  # eval is annoying to vectorise
  mutate(selected = eval(parse(text = criterion))) %>%
  ungroup()

df

## # A tibble: 200 x 5
##        x       u      y criterion selected
##    <dbl>   <dbl>  <dbl> <chr>     <lgl>   
##  1 -2.22 -0.0125 -2.24  x >= 0    FALSE   
##  2 -2.22 -0.0125 -2.24  y >= 0    FALSE   
##  3 -1.56 -1.12   -2.68  x >= 0    FALSE   
##  4 -1.56 -1.12   -2.68  y >= 0    FALSE   
##  5 -1.54  0.577  -0.963 x >= 0    FALSE   
##  6 -1.54  0.577  -0.963 y >= 0    FALSE   
##  7 -1.44 -1.39   -2.83  x >= 0    FALSE   
##  8 -1.44 -1.39   -2.83  y >= 0    FALSE   
##  9 -1.43 -0.543  -1.97  x >= 0    FALSE   
## 10 -1.43 -0.543  -1.97  y >= 0    FALSE   
## # … with 190 more rows

Now df contains two copies of each observation—one for each selection criterion—and an indicator for whether the observation is selected by each criterion. We can use df to estimate OLS coefficients and their standard errors among observations with $x_i\ge0$ and $y_i\ge0$:

df %>%
  filter(selected) %>%
  group_by(criterion) %>%
  summarise(n = n(),
            estimate = cov(x, y) / var(x),
            std.error = sd(y - estimate * x) / sqrt(n))

## # A tibble: 2 x 4
##   criterion     n estimate std.error
##   <chr>     <int>    <dbl>     <dbl>
## 1 x >= 0       48    1.02      0.136
## 2 y >= 0       47    0.356     0.110

The OLS estimate among observations with $x_i\ge0$ approximates the true value $\beta=1$ well. However, the estimate among observations with $y_i\ge0$ is much smaller than one. We can confirm this visually:

What’s going on? Why do we get biased OLS estimates of $\beta$ among observations with $y_i\ge0$ but not among observations with $x_i\ge0$?

The key is to think about the errors $u_i$ in each case. Since the $x_i$ and $u_i$ are independent, selecting observations with $x_i\ge0$ leaves the distributions of the $u_i$ unchanged—they still have zero means, equal variances, and zero correlations with the $x_i$. Thus, the Gauss-Markov assumptions still hold and we still obtain unbiased OLS estimates of $\beta$.

In contrast, the $x_i$ and $u_i$ are negatively correlated among observations with $y_i\ge0$. To see why, notice that if $y_i=x_i+u_i$ then $y_i\ge0$ if and only if $x_i\ge-u_i$. So if $x_i$ is low then $u_i$ must be high (and vice versa) for the observation to be selected. Thus, among selected observations, we have $$u_i=\rho x_i+\varepsilon_i,$$ where $\rho<0$ indexes (and, in this case, equals) the correlation between the $x_i$ and $u_i$, and where the residuals $\varepsilon_i$ are uncorrelated with the $x_i$. Our DGP then becomes $$y_i=(\beta+\rho)x_i+\varepsilon_i.$$ The $\varepsilon_i$ have equal variances (equal to $1+\rho^2$ in this case) and, again, are uncorrelated with the $x_i$. Therefore, the OLS estimate $$\hat\rho=\frac{\Cov(u,x)}{\Var(x)}$$ of $\rho$ is unbiased¹, and for our toy data equals $\hat\rho\approx-0.644$ among observations with $y_i\ge0$. Subtracting $\hat\rho$ from $\hat\beta$ then gives $$\begin{align} \hat\beta-\hat\rho &\approx 0.356 - (-0.644) \\ &= 1, \end{align}$$ recovering the true value $\beta=1$.

The table below reports 95% confidence intervals for $\hat\beta$, $\hat\rho$, and $(\hat\beta-\hat\rho)$, estimated by simulating the DGP $y_i=x_i+u_i$ described above 100 times. The table confirms that the OLS estimate $\hat\beta$ of $\beta=1$ is unbiased among observations with $x_i\ge0$ but biased negatively among observations with $y_i\ge0$.

Observations	`$\hat\beta$`	`$\hat\rho$`	`$\hat\beta-\hat\rho$`
All	1.005 ± 0.002	0.005 ± 0.002	1.000 ± 0.000
With `$x_i\ge0$`	1.001 ± 0.004	0.001 ± 0.004	1.000 ± 0.000
With `$y_i\ge0$`	0.547 ± 0.003	-0.453 ± 0.003	1.000 ± 0.000

The estimate $\hat\beta$ always differs from $\beta$ by $\hat\rho$, which is significantly non-zero among observations with $y_i\ge0$. However, this pattern is not useful empirically because we generally don’t observe the $u_i$ and so can’t estimate $\hat\rho$ to back out the true value of $\beta=\hat\beta-\hat\rho$. Instead, we may use the Heckman correction to adjust for the bias introduced through non-random selection.

In empirical settings, selecting observations with $x_i\ge0$ may lead to biased estimates when (i) there is heterogeneity in the relationship between $y_i$ and $x_i$ across observations $i$, and (ii) OLS is used to estimate an average treatment effect.² In particular, if the $x_i$ are correlated with the observation-specific treatment effects then restricting to observations with $x_i\ge0$ changes the distribution, and hence the mean, of those effects non-randomly.

We can rewrite $\varepsilon_i=\alpha+(\varepsilon_i-\alpha)$, where $\alpha$ is the mean of the $\varepsilon_i$, and where the $(\varepsilon_i-\alpha)$ have zero means, equal variances, and zero correlations with the $x_i$. ↩︎
Thanks to Shakked for pointing this out. ↩︎

Modelling bacterial extinction

Sun, 14 Jun 2020 00:00:00 +0000

This week’s Riddler Classic poses a question about bacteria (paraphrased for brevity):

Each bacterium in a colony splits into two copies with probability $p$ and dies with probability $(1-p)$. If the colony starts with one bacterium, what is the probability that the colony survives forever?

We can model the colony’s size as a Galton-Watson process. Let $X_t$ be the colony’s size in generation $t\in\{1,2,\ldots\}$ and let $Y_{it}$ be the number of offspring generated by bacterium $i\in\{1,2,\ldots,X_t\}$. The $Y_{it}$ are independently and identically distributed, with $$\Pr(Y_{it}=y)=\begin{cases} p & \text{if}\ y=2 \\ 1-p & \text{if}\ y=0 \\ 0 & \text{otherwise} \end{cases}$$ for each $i$ and $t$. The colony’s size grows according to $$X_{t+1}=\sum_{i=1}^{X_t}Y_{it}$$ with $X_1=1$. Our goal is to compute $$\lim_{t\to\infty}\Pr(X_t>0)=1-\lim_{t\to\infty}q_t,$$ where $q_t\equiv\Pr(X_t=0)$ is the probability that the colony is extinct by generation $t$.

We can compute $q_t$ by conditioning on $Y_{11}$, the number of offspring generated by the first bacterium. If $Y_{11}=0$ then the colony is extinct from the second generation onwards. However, if $Y_{11}=2$ then there are two sub-colonies in the second generation that must be extinct in $(t-1)$ generations if the whole colony is extinct by generation $t$. These sub-colonies grow independently, so the probability that both are extinct in $(t-1)$ generations is $q_{t-1}^2$. Thus, by the law of total probability, we have $$\begin{align} q_t &= \Pr(X_t=0\,\vert\,Y_{11}=0)\Pr(Y_{11}=0)+\Pr(X_t=0\,\vert\,Y_{11}=2)\Pr(Y_{11}=2) \\ &= 1\times(1-p)+q_{t-1}^2\times p \\ &= 1-p+pq_{t-1}^2 \end{align}$$ for $t\ge2$. Defining $q\equiv\lim_{t\to\infty}q_t$ and taking limits as $t\to\infty$ gives¹ $$q=1-p+pq^2,$$ which has solutions $$\begin{align} \newcommand{\abs}[1]{\lvert#1\rvert} q &= \frac{1\pm\sqrt{1-4p(1-p)}}{2p} \\ &= \frac{1\pm\sqrt{(2p-1)^2}}{2p} \\ &= \frac{1\pm\abs{2p-1}}{2p}. \end{align}$$ The larger solution exceeds unity when $p<0.5$, which we cannot have because $q$ is a probability. Thus $$\lim_{t\to\infty}\Pr(X_t>0)=1-\frac{1-\abs{2p-1}}{2p}.$$ For example, if $p=0.8$ then the colony survives forever with probability $$1-\frac{1-\abs{2\times0.8-1}}{2\times0.8}=0.75.$$ If $p<0.5$ then extinction is guaranteed because each bacterium generates fewer than one offspring on average.

The function $x\mapsto x^2$ is continuous and so preserves limits. ↩︎

Applying to economics PhD programs

Sat, 13 Jun 2020 00:00:00 +0000

Last year I applied to several economics PhD programs at elite universities and business schools. I applied to twelve programs (nine in economics and three in business), was accepted by three, and chose to study at Stanford. This post describes my experience with the application process and offers some advice to future applicants.

Before applying

The programs I applied to accepted applications between late September and early December. However, these applications depended on tasks completed earlier: earning a degree, gaining research experience, completing the Graduate Record Exam (GRE), and choosing where to apply.

Earning a degree

Every program required that I held the equivalent of a four-year bachelor’s degree or higher. Most stated explicitly that a master’s was not necessary. Some stated explicitly that applicants need not have a major in economics, but some prior coursework (e.g., intermediate microeconomics) helps to signal interest and familiarity. Most stated explicitly that applicants should be comfortable with undergraduate-level calculus, linear algebra, and probability and statistics.

Gaining research experience

While not required explicitly, my impression is that most successful applicants to top programs have some research experience. Such experience helps demonstrate that you know what research is and can conduct it successfully. Moreover, everyone applying to top programs has stellar grades, so having research experience helps you stand out.

Thankfully, there are many ways to gain research experience. I have four recommendations.

First, write an honours or master’s thesis. Doing so provides early evidence that you’re interested in research and can work independently.

Second, work with professors while studying. The University of Canterbury (UC), where I completed my bachelor’s degree, offers scholarships to work with professors during summer breaks. I won one to work with Richard Watt on a theoretical project related to insurance pricing. Completing the project gave me experience to discuss in my statement of purpose and gave Richard something to discuss in his recommendation letter.

Third, work at a research-oriented organisation after finishing your bachelor’s or master’s. In New Zealand, the best place is Motu or the Reserve Bank, depending on whether you’re more interested in microeconomics or macroeconomics. Working at Motu has improved my technical and research skills, and given me experience working with respected economists on substantive research projects. It has also helped clarify what a “research career” looks like and whether it’s something I want to pursue.

Finally, consider completing a pre-doctoral fellowship at an elite university. These fellowships typically last one or two years, and involve assisting professors with their research. Pre-doctoral fellowships deliver similar benefits to working at places like Motu. However, some fellowships (e.g., those offered by Opportunity Insights at Harvard and SIEPR at Stanford) allow you to take graduate courses while working, further strengthening your profile. Moreover, working with well-known economists at elite universities (and impressing them) helps you gain strong recommendation letters.

Completing the GRE

All programs required official scores from the (general) GRE, a standardised test comprising three sections: quantitative reasoning, verbal reasoning, and analytical writing. The test can be attempted multiple times. Programs consider only your highest score on each section.

I sat the GRE once, in 2018. The test took about four hours. The quantitative and verbal reasoning sections each comprised two sets of 20 multi-choice questions. The quantitative section was mostly high school-level mathematics. (New Zealanders: think NCEA Level 1 or 2.) The verbal section tested reading comprehension and vocabulary. The analytical writing section comprised two short, typed essay responses to prompts given during the test. I think anyone who recently earned a bachelor’s degree in economics could do well on the test with 2–4 weeks of study.

Jones et al. (2020) survey graduate admissions coordinators, who report placing more emphasis on quantitative reasoning scores than verbal reasoning scores when evaluating applicants. Both scores are less important at higher ranked programs because applicants to such programs tend to have higher scores, leaving less variation for identifying applicants’ relative abilities. For example, Harvard’s economics department states that admitted candidates’ quantitative scores range “in the 97th percentile.” I scored in the 94th percentile and would have resat the test if I had scored any lower.

Choosing where to apply

I applied to most programs in the “top 10,” and a few more specialised programs that matched my interests and geographic preferences. I figured that if I was going to move overseas, away from my family and friends, then I better go somewhere excellent. If I had a weaker technical background or less research experience then I might have aimed lower.

Beyond this “aim high” strategy, I have two recommendations.

First, apply to as many programs as you can afford and would attend. The marginal effort cost of applying to each program falls quickly after preparing your first set of application materials. Moreover, although the application fees can sting, they are small compared to the expected gain in life satisfaction from being admitted.

Second, apply to programs at business schools as well as economic departments. Chicago, Harvard, Northwestern, NYU, and Stanford’s business schools all offer excellent economics-focused PhD programs. They provide similar technical training and faculty access to “traditional” programs. However, business schools tend to offer larger stipends and require less teaching than economics departments. Business schools tend to make fewer offers, but they also tend to receive fewer applications.

Application materials

All of the programs I applied to required the following materials:

An application form, submitted online;
Copies of my academic transcripts;
Official GRE score reports;
Recommendation letters;
A CV;
A statement of purpose.

Most programs required a writing sample. Some required a (short) diversity statement. All required payment of a 75–125 USD application fee.

Overall, it took about a month to prepare my application materials and about a day to tailor them to each program. To track my progress and help manage my time, I maintained a checklist of form sections to complete and materials to upload.

Transcripts

Stanford asked for official copies of my academic transcripts. All other programs accepted “unofficial” copies. I ordered a digital copy from UC, which set up a My eQuals account with my transcript uploaded as a PDF and certified by the UC registrar. I shared this certified version with Stanford, saving me about 190 USD worth of third-party certification fees. I downloaded the PDF version from My eQuals and used it as the unofficial copy for my other applications.

In addition to transcripts, some schools asked for more information about my prior coursework. Harvard and MIT asked for comprehensive lists of course codes and titles, dates completed, grades obtained, and textbooks used. Other programs asked for similar information but only for the handful of “most advanced” courses I’d taken in economics, mathematics, and statistics. Stanford asked me to match the courses I’d taken with courses offered at Stanford. The matching took a while because the courses I took at UC often matched Stanford courses in different subject areas and at different degree levels.

New Zealand universities use a nine-point GPA system, whereas the universities I applied to use a four-point system. Some programs asked me to report my GPA on its original scale, some asked me to convert it to the four-point scale, and some asked me to leave the GPA field blank. Overall, the difference in systems didn’t seem to be problematic.

GRE score reports

All programs asked for official GRE score reports. The testing fee (205 USD) covers the cost of sending scores to up to four institutions, nominated on test day. Sending scores to additional institutions costs 27 USD per institution. I didn’t nominate any schools on test day because I wasn’t sure whether I would need to resit the test, or whether sending low scores would hurt my admissions chances even if I resat the test and performed better. Once I sent my score reports, most programs confirmed receipt after about a week.

Recommendation letters

All programs asked me to nominate three recommendation letter writers. I arranged my recommenders about two months in advance. I gave each a list of programs I was applying to, a description of each program, and the due date for their letters. I also provided copies of my CV, transcript, and draft statements of purpose.

Whenever I nominated a recommender, I was asked whether I wanted to waive my FERPA right to view their letter upon admission. I always waived. I wasn’t concerned that my recommenders would change what they wrote if they knew I could read their letters. Instead, I was concerned that admissions committees would observe that I chose not to waive access, assume that my recommenders responded by providing stronger-than-truthful recommendations, and subsequently discount the quality of those recommendations.

Statements of purpose

All programs asked for a statement describing my preparation for graduate study, my research experience and interests, and my career goals. The statement I submitted to Stanford contained

a brief introduction,
a paragraph describing my educational background,
five paragraphs describing my research experience,
a paragraph stating my research interests, and
a paragraph stating my career goals.

I focused on my research experience because I felt that it was my comparative advantage over other applicants, whom I assumed were well-trained technically and had more prestigious alma maters.

Writing samples

Most programs asked for a writing sample. Some programs required at least 15 pages; some required at most 10 pages. In both cases, I used an excerpt from my most recent journal submission. For long samples, I excluded figures and tables, which happened to leave 15 pages. For short samples, I included only the first eight pages, which contained the introduction, literature review, method, and data sections. I always included a cover page describing the excerpt and stating the full paper’s abstract.

I could have submitted my honours thesis, which analysed a theoretical model of insurance and saving. However, I felt that my academic transcript signalled my technical skills adequately. Instead, I wanted my writing sample to demonstrate skills not demonstrated by other application materials: identifying interesting and important research questions, and synthesising literature.

Diversity statements

Stanford and Yale asked me to explain how I would contribute to diversity on campus. My response to Stanford read as follows:

I grew up in Wakefield, a small rural town in New Zealand. I have been fortunate to attend university, to discover my passion for research, and to collaborate on research projects with economists from Europe and North America. These projects have benefited from the diverse ideas and experiences of my collaborators, which have increased the quality of our work.

I am excited to continue engaging with ideas in an inclusive research environment as a graduate student at Stanford. I am also excited to share my cultural experiences in New Zealand with my Stanford classmates, and to learn about their experiences in other countries. Doing so will increase our understanding of how different cultural values shape economic and social outcomes. This understanding will enhance our ability to conduct globally relevant economic research that considers a range of perspectives.

After applying

Clicking “submit” on the online application forms began the long—about three month—wait for responses. In two cases, those responses were invitations for interviews; in most cases, they were admissions decisions.

Waiting for responses

On waiting for responses, I offer three pieces of advice.

First, take a break. Applying to PhD programs takes many years of effort earning a degree, gaining research experience, building relationships with recommendation letter writers, completing the GRE, and preparing your applications. Make time to acknowledge and celebrate that effort.

Second, realise that there is nothing you can do (except, if invited, prepare for interviews) to change your admissions decisions. Worrying is futile. Instead, try to find fun and engaging ways to spend your time that take your mind off your applications. I ran a lot and worked on some blog posts.

Third, try to stay off Urch and TheGradCafe. In late January, people will start using those fora to share their anxiety and admissions results. You will, after months of waiting, be hungry for news. However, if you’re going to get good news then you will receive it from the program first. Programs generally send all acceptances at the same time (or, at least, on the same day). Thus, online fora can only deliver bad news: others received acceptance notifications but you did not.

Interviews

As far as I know, only business schools conduct interviews. I interviewed for the business programs at Harvard and MIT, in late January and early February. Both interviews comprised discussing my research experience and interests, and why those interests are best pursued at a business school. The interviews lasted about fifteen minutes each and took place over Zoom.

Admissions decisions

Most programs sent admissions decisions in late February or early March. They were either acceptances, rejections, or being placed on a wait list. The program for which I was wait-listed was weaker than my best offer at the time, so I declined them promptly to help the market clear.

Estimating research field similarities

Sat, 30 May 2020 00:00:00 +0000

Research often draws on multiple fields, each contributing field-specific ideas and techniques to the production of new knowledge. The more similar are two fields, the easier it is to combine their ideas and techniques, the more frequently such combination occurs, and the more demand there is for ways to publish the consequent research. Likewise, the more similar are two fields, the easier it is to attract (subscription fee-paying) readers to journals covering those fields, and so the more willing publishers are to supply such journals. Thus, in equilibrium, the frequency with which journals cover pairs of research fields rises with the similarity between those fields.

This argument suggests that we can estimate research field similarities from data on journals and the fields they cover. One source of such data is the Scopus source list, which matches journals to fields within Scopus’ All Science Journal Classification (ASJC) system. The Scopus source list covers 24,039 active journals, each assigned to one or more of 26 ASJC fields.¹ Each of these fields belongs to one of four subject areas: Health, Life, Physical, and Social Sciences. The bar chart below presents the distribution of journals across fields, with bars coloured by subject area.²

I estimate the similarity between ASJC fields as follows. First, I count the number of journals assigned to each pair of fields. I then divide these co-assignment counts by the number of journals assigned to at least one of the paired fields. This normalisation delivers the Jaccard similarities between the sets of journals assigned to each field.

On average, each ASJC field pair shares 62.43 co-assignments and a Jaccard similarity of 0.02. About 83% of pairs share at least one journal co-assignment. The table below presents the ten field pairs with the greatest Jaccard similarities.

Field 1	Field 2	Co-assignments	Jaccard similarity
Arts and Humanities	Social Sciences	2,247	0.29
Materials Science	Physics and Astronomy	403	0.23
Chemical Engineering	Chemistry	222	0.19
Engineering	Materials Science	564	0.18
Business, Management and Accounting	Economics, Econometrics and Finance	359	0.18
Agricultural and Biological Sciences	Environmental Science	459	0.16
Computer Science	Mathematics	395	0.15
Chemistry	Materials Science	252	0.15
Computer Science	Engineering	492	0.14
Engineering	Physics and Astronomy	383	0.12

We can visualise the similarities between ASJC fields by constructing a network in which (i) nodes represent fields and (ii) edges have weights proportional to incident nodes’ similarities. I present this network below, restricting my visualisation to the sub-network induced by the 50 edges of largest weight. To improve readability, I label some nodes using the field abbreviations given in parentheses in the bar chart above. I draw fields with greater similarities closer together.

Overall, fields tend to be most similar to other fields in the same subject area. The proximities among nodes, reflecting fields’ pairwise similarities, seem intuitive: Chemistry (Chem) and Chemical Engineering (ChemEng) are obviously similar, the biological sciences are clustered together, and Astronomy researchers probably don’t read many Nursing journals—indeed, there are no journal co-assignments between Physics and Astronomy (PhysAstr) and Nursing.

The paths between fields also make sense. For example, Social Science (SocSci) relies on Neuroscience (Neur) to the extent it helps explain how people think and behave, which suggests the fields should be connected via Psychology (Psyc). Likewise, Business, Management and Accounting (BusMgtAcc) rely on Mathematics (Math) to the extent that it helps model how people make decisions, which suggests that the fields should be connected via Decision Science (DscnSci).

I exclude the 27th field, “Multidisciplinary,” from my analysis. ↩︎
I count journals “fractionally” so that, for example, journals assigned to four fields contribute a quarter to each field’s count. ↩︎

Policymaking under uncertainty

Sun, 17 May 2020 00:00:00 +0000

The COVID-19 pandemic exposes policymakers to several sources of uncertainty:

How dangerous is the virus?
What will be the consequences of policies introduced to combat the virus?
How willing and able are people to tolerate those consequences?

Under this uncertainty, policymakers must determine which policies to introduce and when to introduce them. On the one hand, acting early may prevent the virus from spreading uncontrollably; on the other, delaying action allows policymakers to collect more information about the virus and the policies best suited to combat it.

This post compares two approaches to policymaking under uncertainty: “commit early,” and “wait and learn.” These approaches highlight the trade-off between acting decisively and waiting for more information. I discuss this trade-off in the context of the New Zealand (NZ) government’s response to COVID-19. However, my discussion also applies in other (non-NZ and non-pandemic) contexts.

To frame my discussion, consider a policymaker (PM) in a world with three time periods. Each period, the world moves into one of many “states” that partition the set of possible futures (e.g., “recession” and “no recession”). Moving between the first and second periods provides more, but not complete, information about the probability distribution of period three states. The PM can influence this distribution by implementing policies in the first or second period.

Under the “commit early” approach, the PM implements policies in the first period based solely on information available in that period. To the extent that these policies are expensive and irreversible, their implementation signals the PM’s confidence in the policies’ necessity and efficacy. This signal can help prevent public dissent. For example, NZ’s relatively early transition to COVID-19 Alert Level 4 signalled our government’s strong belief that the cost of letting the virus spread outweighed the cost of entering a nationwide lockdown. This signal probably made New Zealanders more willing to tolerate the consequences of staying home—they were told, in no uncertain terms, that doing so would “save lives.”

Committing early also provides information about future conditions to households and firms, who may be less informed than the PM about the distribution of future states. For example, one of the NZ government’s earliest actions during the COVID-19 pandemic was to implement a wage subsidy scheme that provided 12 weeks of financial support to firms and their employees. The scheme signalled that our government expected the economic downturn caused by the virus to last at least 12 weeks. This signal allowed households and firms to calibrate their expectations about, and adjust their behaviour to prepare for, the future.

One problem with the “commit early” approach is that the PM may commit to policies that appear optimal ex ante but turn out to be sub-optimal ex post. As time passes, the PM gains more information about the distribution of future states and about which policies are most likely to deliver favourable states. Consequently, the PM may end up regretting committing to, and paying for, policies they wouldn’t have chosen if they had waited for more information.

This regret can be avoided by adopting a “wait and learn” approach. Under this approach, the PM delays implementing policies until more information about their relative merits arrives in the second period. This delay allows the PM to preserve the real options that would otherwise be destroyed by implementing irreversible policies in the first period. For example, delaying wage subsidy payouts until the economic impacts of COVID-19 were clearer may have allowed the NZ government to avoid giving money to businesses that didn’t need it.

However, delaying policy decisions may also delay decisions made by households and firms, who rely on policies as coordination devices. For example, delaying the decision to allow inter-regional transport may delay freight companies’ decisions to schedule inter-regional shipments, which, in turn, delays production decisions by firms with inter-regional supply chains. These supply-side delays may induce undesirable demand-side responses, such as “panic buying” food and homeware. The more peoples’ decisions depend on others’ decisions, the more the PM’s initial delay spreads throughout the economy and the more disruptive that delay becomes.

Delaying policy decisions also allows the costs of indecision (e.g., deaths from uncontrolled exposure to COVID-19) to accumulate. The PM must trade these costs off with the benefits of waiting for more information. These benefits may appear large ex ante but turn out to be small ex post. Moving into the second period may not change the PM’s preferences over policy options if the new information merely confirms what was already known in the first period. This confirmation may make the PM believe that they accrued the costs of delaying action unnecessarily. Thus, whereas the “commit early” approach induces regret when initial estimates turn out to be wrong, the “wait and learn” approach induces regret when initial estimates turn out to be right.

Although committing early destroys some real options, it may create other real options in their place. For example, NZ’s early lockdown likely prevented the virus from overburdening our healthcare providers, giving them time to plan and prepare for a range of future COVID-19 scenarios. Similarly, our government’s agreement to let the Reserve Bank buy government bonds gave the Bank more flexibility to provide monetary stimulus if the need arises. In both cases, taking early, decisive action opened paths that may have been unavailable if our government had chosen to wait and learn.

Ultimately, the PM should adopt the “commit early” approach whenever the net benefits of acting early exceed the net benefits of waiting. However, valuing these net benefits requires

estimating the likelihood and quality of future states,
choosing a (politically defensible) rate at which to discount future payoffs, and
valuing changes in net optionality.

These three requirements raise the complexity of the PM’s choice problem and the cognitive cost of finding its solution. Such bounds on the PM’s rationality may lead to sub-optimal policy decisions, both ex ante and ex post.

Thanks to my dad for inspiring this post and to Arthur Grimes for his comments.

Transitivity in positive correlations

Fri, 24 Apr 2020 00:00:00 +0000

Let $X$, $Y$ and $Z$ be random variables. Suppose that both $X$ and $Y$ are positively correlated with $Z$. Are $X$ and $Y$ positively correlated?

The answer to this question is “not necessarily.” To see why, let $\rho\in[-1,1]$ be a constant, and define the random variables $$X=\rho Z+W \tag{1}$$ and $$Y=\rho Z-W \tag{2}$$ with $Z\sim N(0,1)$ and $W\sim N(0,1-\rho^2)$.¹ Then $W$, $X$, $Y$ and $Z$ have zero means, while $X$, $Y$ and $Z$ have unit variances. It follows that $$\begin{align} \newcommand{\E}{\mathrm{E}} \newcommand{\Corr}{\mathrm{Corr}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\Var}{\mathrm{Var}} \Corr(X,Y) &= \frac{\Cov(X,Y)}{\sqrt{\Var(X)}\sqrt{\Var(Y)}} \\ &= \Cov(X,Y) \\ &= \E[XY]-\E[X]\E[Y] \\ &= \E[XY], \end{align}$$ and similarly $\Corr(X,Z)=\E[XZ]$ and $\Corr(Y,Z)=\E[YZ]$. Now $$\begin{align} \E[XZ] &= \E[(\rho Z+W)Z] \\ &= \rho\E[Z^2]+\E[WZ] \\ &= \rho\Var(Z)+\rho\E[Z]^2+\Cov(W,Z)+\E[W]\E[Y] \\ &= \rho \end{align}$$ because $W$ and $Z$ are independent. A similar argument yields $\E[YZ]=\rho$. Finally, substituting $(1)$ into $(2)$ so as to eliminate $W$ gives $$Y=2\rho Z-X,$$ from which we obtain $$\begin{align} \Corr(X,Y) &= \E[XY] \\ &= \E[X(2\rho Z-X)] \\ &= 2\rho\E[XZ]-\E[X^2] \\ &= 2\rho\E[XZ]-\Var(X)+\E[X]^2 \\ &= 2\rho^2-1. \end{align}$$ Thus, if $\rho\in(0,1/\sqrt{2})$ then $X$ and $Y$ share a negative correlation even though both are correlated positively with $Z$. Intuitively, if $\rho$ is sufficiently small then the negative correlation between the error terms $W$ and $-W$ dominates the positive correlations between $X$ and $Z$, and $Y$ and $Z$.

Here $N(\mu,\sigma^2)$ denotes the normal distribution with mean $\mu$ and variance $\sigma^2$. ↩︎

Greedy Pig strategies

Sat, 18 Apr 2020 00:00:00 +0000

Greedy Pig is a game used to teach probability and statistics to primary school students. The game comprises a set of rounds in which players roll a fair six-sided die until they choose to stop, in which case their score for that round is equal to the sum of rolled values, or they roll a one, in which case their score for that round is zero. The player with the highest total score across all rounds wins.

Since the scores obtained in each round are independent, players can maximise their total score across all rounds by maximising their score in each round independently. One strategy is to commit to rolling the die $n$ times in each round, where $n$ is chosen to maximise the expected resulting score. We can make this choice as follows. First, let $X_k$ be the outcome of the $k^\text{th}$ die roll. This outcome has probability distribution $$\Pr(X_k=x)=\begin{cases}1/6&\text{if}\ x\in\{1,2,3,4,5,6\}\\ 0&\text{otherwise}.\end{cases}$$ Next, let $1_{X_k>1}$ be the indicator variable for the event in which $X_k>1$. Then the score after $n$ rolls is given by $$S_n=\sum_{k=1}^nX_k\pi_n,$$ where $$\pi_n=\prod_{k=1}^n1_{X_k>1}$$ is the indicator variable for the event in which all of the first $n$ rolls exceed unity. Now, by the linearity of the expectation operator and the law of total expectation, the score $S_n$ has expected value $$\begin{align} \mathrm{E}[S_n] &= \sum_{k=1}^n\mathrm{E}[X_k\pi_n] \\ &= \sum_{k=1}^n\left(\mathrm{E}[X_k\pi_n\,\vert\,\pi_n=1]\Pr(\pi_n=1)+\mathrm{E}[X_k\pi_n\,\vert\,\pi_n=0]\Pr(\pi_n=0)\right) \\ &= \sum_{k=1}^n\mathrm{E}[X_k\,\vert\,\pi_n=1]\Pr(\pi_n=1). \end{align}$$ Since $\pi_n=1$ if and only if $X_k>1$ for each $k\in\{1,2,\ldots,n\}$, and since die rolls are independent, we have $$\begin{align} \mathrm{E}[X_k\,\vert\,\pi_n=1] &= \mathrm{E}[X_k\,\vert\,X_k>1] \\ &= \frac{2+3+4+5+6}{5} \\ &= 4 \end{align}$$ and $$\Pr(\pi_n=1)=\left(\frac{5}{6}\right)^n.$$ Hence $$\mathrm{E}[S_n]=4n\left(\frac{5}{6}\right)^n$$ for each $n$, which obtains its maximum value of 8.04 when $n\in\{5,6\}$. Therefore, players who commit to a fixed number of rolls should commit to five or six rolls in each round to maximise their expected score.

Another strategy is to continue rolling until reaching some target score $S^*$. This strategy allows players to respond to their realised sequence of rolls. Intuitively, players who realise a run of high-value rolls have more to lose by rolling again and so may be less willing to do so. To determine the value of $S^*$, let $Y_k$ denote the payoff from the $k^\text{th}$ roll and notice that this payoff has expected value $$\begin{align} \mathrm{E}[Y_k] &= \mathrm{E}[X_k\,\vert\,X_k>1]\Pr(X_k>1)-S_{k-1}\Pr(X_k=1) \\ &= \frac{20-S_{k-1}}{6}. \end{align}$$ Thus, rolling again delivers a positive expected payoff if and only if $S_{k-1}<20$, and so players seeking to maximise their expected score should stop rolling when their score reaches $S^*=20$. This argument also clarifies why both $n=5$ and $n=6$ maximise $\mathrm{E}[S_n]$: players with a non-zero score after five rolls have a conditional expected score of $\mathrm{E}[S_5\,\vert\,\pi_5=1]=20$, so the expected gain in score for such players from a sixth roll is zero.

We can compare the “roll five times” and “stop at 20” strategies via simulation. First, define a function simulate_strategy that takes as arguments either a fixed number of rolls n or a target score t, and simulates the player’s score from adopting their chosen strategy:

simulate_strategy <- function(n = NULL, t = NULL) {
  if (is.null(n) & is.null(t)) stop('`n` or `t` must be non-NULL')
  score <- 0
  k <- 0
  done <- F
  while (!done) {
    x <- sample(1:6, 1)
    if (x == 1) {
      score <- 0
      done <- T
    } else {
      score <- score + x
      k <- k + 1
      done <- ifelse(!is.null(n), k >= n, score >= t)
    }
  }
  score
}

Next, define two wrapper functions for simulating each strategy separately:

simulate_n <- function(n) simulate_strategy(n = n)
simulate_t <- function(t) simulate_strategy(t = t)

Finally, we can simulate 10,000 games using each strategy and store the realised scores:

set.seed(0)
scores_n <- sapply(rep(5, 1e4), simulate_n)
scores_t <- sapply(rep(20, 1e4), simulate_t)

The “stop at 20” strategy delivers a mean score of 8.13, which is 1.69% higher than the mean score delivered by the “roll five times” strategy. However, the “stop at 20” strategy is also 4.12% more likely to deliver a score of zero than the “roll five times” strategy. We can see this by plotting the distributions of simulated scores delivered by the two strategies:

library(dplyr)
library(ggplot2)
library(tidyr)

tibble(`Roll five times` = scores_n, `Stop at 20` = scores_t) %>%
  gather(Strategy, Score) %>%
  count(Strategy, Score) %>%
  ggplot(aes(Score, n / 1e4)) +
  geom_col(aes(fill = Strategy), alpha = 0.75, position = 'dodge') +
  labs(y = 'Relative frequency',
       title = 'Comparing Greedy Pig strategies',
       subtitle = 'Distribution of scores across 10,000 simulated games')

The distribution of non-zero scores under the “stop at 20” strategy is asymmetric about its conditional mean of 21.69, and is bounded below by 20 and above by 25. In contrast, the distribution of non-zero scores under the “roll five times” strategy is symmetric about its conditional mean of 20, and is bounded below by 10 and above by 30.

The “roll five times” and “stop at 20” strategies are heuristics for maximising players’ scores in each round. These heuristics may be sub-optimal in some situations. For example, if one player remains in the last round and has accumulated enough total score to win the game then they should always stop rolling.

Generating random graphs with communities

Tue, 07 Apr 2020 00:00:00 +0000

Suppose I want to generate some random graphs that exhibit community structure. For example, I might be interested in simulating how information or diseases spread in social networks, and I suspect—but lack data to confirm—that people sort into communities based on their personal and professional interests.

One approach is to use the stochastic block model. In this model, each vertex belongs to one of $r$ disjoint communities $C_1,C_2,\ldots,C_r$, and vertices $u\in C_i$ and $v\in C_j$ are adjacent with probability $p_{ij}$. Varying $p_{ij}$ across $(i,j)$ pairs varies the level of connectivity within and between communities. For example, choosing $$p_{ij}=\begin{cases} p & \text{if}\ i=j \\ q & \text{otherwise} \end{cases}$$ for some probabilities $p$ and $q<p$ delivers random graphs that tend to contain more edges within communities than between communities. We can simulate this special case—known as the “planted partition model” (PPM)—in R using the sample_ppm function defined below.

library(igraph)

sample_ppm <- function(memb, p, q) {
  mat <- t(combn(seq_along(memb), 2))
  prob <- c(q, p)[1 + (memb[mat[, 1]] == memb[mat[, 2]])]
  el <- mat[which(runif(nrow(mat)) < prob), ]
  graph_from_edgelist(el, directed = FALSE)
}

sample_ppm takes as arguments a vector memb of community memberships, and the edge probabilities $p$ and $q$. The function constructs a matrix mat of vertex pairs, determines the probabilities that these pairs are adjacent, and uses these probabilities to create a random edge list el and corresponding random graph.

For example, let’s simulate a PPM random graph with 50 vertices and four communities, and with edge probabilities $p=1/3$ and $q=0.01$:

set.seed(0)
memb <- sample(1:4, 50, replace = TRUE)
G <- sample_ppm(memb, 1/3, 0.01)

We can visualise G using ggraph:

library(ggraph)

G %>%
  ggraph() +
  geom_edge_link0(colour = 'grey75') +
  geom_node_point(aes(col = factor(memb)), show.legend = FALSE) +
  scale_colour_brewer(palette = 'Set1') +
  theme_void()

The communities in G—identified by vertices’ colours—contain many internal edges but few external edges. Thus, if informed or infected vertices spread information or disease among their neighbours with equal probabilities, then we would expect faster diffusion within communities than between communities.

Insurance and saving

Fri, 03 Apr 2020 00:00:00 +0000

The seminal model of insurance demand (Arrow, 1963; Mossin, 1968) describes a consumer who chooses the level of coverage $I^*$ that maximises their expected utility $$\phi(I)=(1-p)u(Y-\pi I)+pu(Y-\pi I-L+I),$$ where $p$ is the probability of suffering a binary loss of fixed size $L$, $Y$ is the consumer’s riskless income, $u$ is their increasing and concave utility function, and $\pi$ is the per-unit price of insurance. In this model, the consumer buys full insurance (i.e., chooses $I^*=L$) if and only if the premium is actuarially fair (i.e., if $\pi=p$), and their demand for insurance decreases with income if their absolute risk aversion decreases with wealth.

A more realistic model would contain at least two periods: one in which the consumer buys insurance and one in which they might suffer an insurable loss. However, in a two-period model, the consumer suffers a form of market incompleteness: they can buy insurance to shift income into the future, but they cannot do the opposite nor vary their net income in the future no-loss state.

This market incompleteness can be resolved by allowing the consumer to save or borrow at the riskless interest rate. Then they can save or borrow to smooth income across time, and buy insurance to smooth income across future states of nature. In particular, they can choose the level of coverage $I^*$ and savings commitment $S^*$ that maximise their expected utility $$\begin{align} \psi(I,S) &= u(Y_1-\pi I-S) \\ &\quad+\delta[(1-p)u(Y_2+(1+R)S)+pu(Y_2+(1+R)S-L+I)], \end{align}$$ where $Y_1$ and $Y_2$ are the consumer’s riskless incomes in the first and second periods, $\delta\in(0,1]$ is their intertemporal discount factor, and $R$ is the riskless interest rate. In this two-period model, the consumer buys full insurance if and only if $$\pi=\frac{p}{1+R},$$ which is the two-period equivalent of the actuarially fair premium rate. One can also show that if the consumer cannot save then $I^*$ is increasing in $Y_1$ and decreasing in $Y_2$, but if they can save then increases in $Y_1$ and $Y_2$ shift $I^*$ in the same direction as they shift the consumer’s absolute risk aversion. Intuitively, if the consumer cannot save and they want to shift income into the future then the only way to do so is to buy more insurance. In contrast, if the consumer can save then they can use their savings commitment to smooth increases in income across time, and adjust their insurance demand according to whether such increases make them more or less absolute risk averse.

Optimal training loads

Mon, 30 Mar 2020 00:00:00 +0000

Suppose I’m training for an upcoming race. I want to choose the training load that maximises my expected performance on race day. The harder I train, the better my performance will be but the more likely I am to injure myself. How should I balance this trade-off between better performance and greater risk of injury?

We can model this choice problem as follows. Let $t\in[0,1]$ represent my training load and $a\in\mathbb{R}$ my natural ability.¹ My performance on race day is some function $f(t,a)$ of $t$ and $a$. I assume that this function is increasing and concave in $t$ (so that there are positive but diminishing returns to training), and increasing in $a$.

I can’t compete if I get injured, which occurs with some probability $p(t,r)$ that depends on my training load and my natural resistance to injury $r\in\mathbb{R}$. I assume that $p$ is increasing and convex in $t$ (so that training increases my likelihood of injury at an increasing rate), and decreasing in $r$.

My objective is to choose the training load $t^*$ that maximises my expected performance² $$\psi(t)=(1-p(t,r))\,f(t,a).$$ My assumptions on the shapes of $f$ and $p$ imply that $\psi$ is concave in $t$. Therefore, the unique optimal training load $t^*$ satisfies the first-order condition (FOC) $$\begin{align} 0 &= \psi'(t^*) \\ &= -p_t(t^*,r)\,f(t^*,a)+(1-p(t^*,r))\,f_t(t^*,a), \end{align}$$ where $\psi'$ denotes the derivative of $\psi$ with respect to $t$, and where $p_t$ and $f_t$ denote the partial derivatives of $p$ and $f$ with respect to $t$. The FOC can be rewritten as $$(1-p(t^*,r))\,f_t(t^*,a)=p_t(t^*,r)f(t^*,a),$$ which shows that I should keep training until the marginal benefit of improved performance (the left-hand side) equals the marginal cost of injury becoming more probable (the right-hand side).

I can’t determine the value of $t^*$ without further assumptions on $f$ and $p$. However, I can determine the relationship between $t^*$ and the parameters $a$ and $r$. Since $\psi''(t)<0$ for all feasible $t$, the implicit function theorem (IFT) implies that $$\mathrm{sign}\frac{\partial t^*}{\partial \theta}=\mathrm{sign}\frac{\partial \psi'(t^*)}{\partial \theta}$$ for each element $\theta$ of the symbol set $\{a,r\}$. Now $$\frac{\partial \psi'(t^*)}{\partial a}=-p_t(t^*,r)\,f_a(t^*,a)+(1-p(t^*,r))\,f_{ta}(t^*,a),$$ where $f_a$ and $f_{ta}$ denote the partial derivatives of $f$ and $f_t$ with respect to $a$, and $$\frac{\partial \psi'(t^*)}{\partial r}=-p_{tr}(t^*,r)\,f(t^*,a)-p_r(t^*,r)\,f_t(t^*,a),$$ where $p_{tr}$ and $p_r$ denote the partial derivatives of $p_t$ and $p$ with respect to $r$. By Young’s theorem, the mixed partials $f_{ta}$ and $p_{tr}$ satisfy $$f_{ta}(t,a)=\frac{\partial}{\partial t}\left(\frac{\partial f(t,a)}{\partial a}\right)$$ and $$p_{tr}(t,r)=\frac{\partial}{\partial t}\left(\frac{\partial p(t,r)}{\partial r}\right)$$ for all feasible $t$, $a$ and $r$. Thus, it seems reasonable to assume that $f_{ta}(t,a)\le0$ and $p_{tr}(t,r)\le0$, which mean that training washes out the benefits of natural ability and resistance to injury. These assumptions, together with the IFT, imply that $t^*$ is decreasing in $a$ and increasing in $r$—that is, I should train harder if I become less naturally able or more resistant to injury.

For example, $t$ could represent the proportion of time before the race that I spend training. ↩︎
I assume that $f$ and $p$ are twice continuously differentiable so that $\psi$ is too. ↩︎

Voting along party lines

Thu, 26 Mar 2020 00:00:00 +0000

Later this year, New Zealanders will vote in a referendum on whether to legalise voluntary euthanasia under the conditions specified in the End of Life Choice Bill (hereafter “the Bill”). Members of Parliament (MPs) read the Bill three times, each time holding a conscience vote on whether to progress the Bill towards becoming legislation. The table below presents the percentage and fraction of MPs who voted in favour of the Bill, separated by political party and reading.¹

Party	First reading	Second reading	Third reading
Act	100% (1/1)	100% (1/1)	100% (1/1)
Green	100% (8/8)	100% (8/8)	100% (8/8)
Independent	100% (1/1)	100% (1/1)	100% (1/1)
Labour	80% (37/46)	72% (33/46)	72% (33/46)
National	36% (20/55)	33% (18/55)	29% (16/55)
NZ First	100% (9/9)	100% (9/9)	100% (9/9)

Most MPs in the coalition government voted in favour, including all MPs from the Green Party and NZ First. In the Bill’s final reading, 72% of Labour MPs followed party leader Jacinda Ardern’s vote in favour, while 71% of National MPs followed party leader Simon Bridges’ vote to oppose. Overall, about a third of Labour and National MPs voted against their party lines.²

New Zealand uses a mixed member proportional electoral system: voters submit votes for a political party and for a representative of their local constituency. Consequently, some “list” MPs enter parliament because they are ranked highly within a party that received many votes rather than because they were the preferred candidate among their local constituents. The table below shows that Labour and National list MPs were more likely to vote along party lines than non-list MPs in the Bill’s third reading.

Party	List MP adherence	Non-list MP adherence
Labour	88% (15/17)	62% (18/29)
National	80% (12/15)	68% (27/40)

The difference in list and non-list MPs’ adherence to party lines has at least two explanations. First, non-list MPs have non-party reasons to be in government—namely, to serve their local constituents—and so may accept weaker idealogical matches than list MPs when self-selecting into party affiliations. This weaker matching would reduce the idealogical polarisation and inertia among non-list MPs relative to list MPs. Indeed, all of the MPs who changed their votes between the Bill’s first and third readings were non-list MPs.

Second, list MPs have stronger incentives to signal loyalty to their party because they cannot rely on support from local constituents to get elected. If list MPs consistently oppose their leaders then they may be demoted within their parties and, consequently, become less likely to re-enter parliament at the next election. Thus, to the extent that MPs want to maximise their chances of re-election, list MPs may be more willing than non-list MPs to ignore their conscience and vote along party lines.

It would be interesting to separate the idealogical sorting and signalling motives that drive greater adherence among list MPs. One strategy could be to track individual MPs across votes and governments, and analyse whether their propensity to vote along party lines is greater when they are list MPs than when they are non-list MPs. However, I can’t find any up-to-date vote data online and don’t particularly want to create them by trawling through decades worth of Hansard documents.³ Perhaps one of my readers is up for the challenge?

The data used in this post are available here. ↩︎
This overlap in preferences among Labour and National MPs reflects the idealogical overlap between the two parties at the centre of the political spectrum. ↩︎
There was an online database of conscience votes among New Zealand MPs, but the database was shut down in late 2019 and hadn’t been updated since 2012. ↩︎

Matching runners

Sat, 21 Mar 2020 00:00:00 +0000

Running in pairs (and, more generally, in groups) can be more rewarding than running alone. Running buddies can motivate each other, share the mental load of maintaining pace, and provide competition and accountability.

The main problem with running buddies is that they can be hard to find. Not everyone is a runner, and runners vary in their abilities and training goals. Moreover, these abilities and goals are mostly unobservable by other runners searching for a buddy. This unobservable variation creates “matching frictions” that prevent runners from sorting into “optimal” pairs.

If prospective running buddies could observe each others’ abilities and training goals then they could form preferences over whom they want to be paired with.¹ Runners could rank potential buddies from most to least prefered, and submit these rankings to a central match-maker (e.g., a team coach) whose task would be to partition the (assumedly even-sized) set of $2n$ runners into $n$ pairs. The socially optimal partition $\mathcal{P}^*$ would minimise the sum $$S(\mathcal{P})=\sum_{\{i,j\}\in\mathcal{P}}(x_{ij}+x_{ji}),$$ where $x_{ij}$ is the rank that runner $i$ assigns to potential buddy $j$. Minimising $S(\mathcal{P})$ would ensure that, on average, runners are paired with their most preferred buddies.

Let $X=(x_{ij})$ be the matrix of preference rankings and let $Y=X+X^T$. One way to find $\mathcal{P^*}$ would be to choose $2n$ entries of $Y$ such that (a) the sum of the chosen entries is minimised, and (b) each row and column of $Y$ contains exactly one chosen entry.² This choice problem is equivalent to the (balanced) assignment problem, and can be solved using the Hungarian algorithm or via linear programming.

The socially optimal partition $\mathcal{P^*}$ of the set of runners into pairs may be “unstable:” there may exist two runners who would rather run with each other than with their assigned buddies.³ For example, suppose there is a set $\{a, b, c, d\}$ of four runners with preference rankings captured by the matrix $$X=\begin{bmatrix} & 1 & 2 & 3 \\ 2 & & 1 & 3 \\ 1 & 2 & & 3 \\ 1 & 2 & 3 & \end{bmatrix}.$$ The corresponding matrix $Y=X+X^T$ of bidirectional sums is $$Y=\begin{bmatrix} & 3 & 3 & \underline{4} \\ 3 & & \underline{3} & 5 \\ 3 & \underline{3} & & 6 \\ \underline{4} & 5 & 6 & \end{bmatrix},$$ where the underlined entries correspond to the socially optimal partition $\mathcal{P}^*=\{\{a,d\},\{b,c\}\}$ with $S(\mathcal{P}^*)=14$. This partition is unstable because runner $a$ would prefer to run with $c$ than $d$, and runner $c$ is indifferent between runners $a$ and $b$. Runners $a$ and $c$ would ignore the match-maker and become buddies, resulting in a socially inferior partition $\mathcal{P}_*=\{\{a,c\},\{b,d\}\}$ with $S(\mathcal{P}_*)=16$.

If the socially optimal partition of runners into pairs is unstable then the match-maker would need to prevent, or at least discourage, so-called “blocking pairs” from deviating from the optimum. For example, the match-maker could restrict runners’ access to training areas (e.g., running tracks and trails) so that no blocking pairs have concurrent access. However, such restrictions may be detrimental to runners’ training and camaraderie, and, consequently, reduce social welfare.

For example, if I wanted to improve my pace then I might prefer to run with someone slightly faster than me so that I can try to match their speed without them racing ahead of me. ↩︎
Equivalently, one could choose $n$ entries in the upper-right triangle of $Y$ that satisfy criteria (a) and (b). This works because $Y$ is symmetric. If $y_{ij}$ is chosen when minimising sums over $Y$ then $y_{ji}$ is chosen when minimising sums over $Y^T$. But $Y=Y^T$, so the sets of chosen entries when minimising over $Y$ and $Y^T$ must be equal. Thus, $y_{ij}$ and $y_{ji}$ must belong to both sets, and so the lower-left triangle can be ignored. ↩︎
On the other hand, I conjecture that every stable partition is socially optimal. ↩︎

Spotify Premium pricing

Tue, 17 Mar 2020 00:00:00 +0000

Spotify offers two music and podcast streaming services: a free, online-only service, and a paid “Premium” service with extra features like unlimited skips and offline playback. Spotify earns some revenue from serving ads to free users, but most of its revenue (about 88% in 2019Q4) comes from Premium subscriptions. This revenue needs to cover Spotify’s fixed and variable costs, which include the costs of maintaining its servers and of paying royalties for streaming artists’ music.

Spotify’s profit function looks something like $$\pi(p)=n\theta(p)(p-v_1)+n(1-\theta(p))(a-v_2)-f,$$ where $p$ is the price of subscribing to Spotify Premium, $n$ is the number of Spotify users, $\theta(p)$ is the price-dependent proportion of these users who pay for Premium, $a$ is the revenue from serving ads to each free user, $v_1$ and $v_2$ are Spotify’s variable costs per Premium and free user, and $f$ is Spotify’s fixed costs. I assume that $\theta(p)$ decreases with $p$ so that Spotify Premium is an ordinary good.

The profit-maximising price $p^*$ satisfies the first-order condition (FOC) $$\begin{align} 0 &= \pi'(p^*) \\ &= n\theta'(p^*)(p^*-v_1)+n\theta(p^*)-n\theta'(p^*)(a-v_2), \end{align}$$ where $\pi'$ and $\theta'$ denote the derivatives of $\pi$ and $\theta$ with respect to $p$. If $a=0$ and $v_1=v_2$ then the FOC can be rewritten as $$\frac{p^*\theta'(p^*)}{\theta(p^*)}=-1,$$ which means that, at $p=p^*$, the demand for Spotify Premium is unit elastic with respect to its price. If free users generate no ad revenue and have the same variable costs per user as Premium subscribers, then Spotify should raise its Premium price until the increased revenue per Premium subscriber exactly offsets the decrease in such subscribers. In contrast, if $a>0$ or if $v_1>v_2$ then Spotify must raise $p^*$ further to decrease $\theta(p^*)$ and avoid the lost ad revenue or increased variable costs from converting too many free users.

Notice that $\pi'(p^*)$ is constant in $n$, so $p^*$ does not change when $n$ changes. In contrast, assuming that the second derivative of $\pi$ with respect to $p$ is negative at $p^*$ (so that $p^*$ is profit-maximising rather than profit-minimising), the implicit function theorem implies that $$\frac{\partial p^*}{\partial a}=\frac{\partial p^*}{\partial v_1}>0>\frac{\partial p^*}{\partial v_2}.$$ In words, the profit-maximising price is increasing in $a$ and $v_1$, and decreasing in $v_2$. Intuitively, if Spotify collects more ad revenue from free users then it can afford to lose some Premium subscribers by raising the Premium price. Likewise, the greater is the difference between $v_1$ and $v_2$, the more expensive it is to serve Premium subscribers relative to free users and so the fewer Premium subscriptions Spotify would prefer to sell.

Stanford

Fri, 13 Mar 2020 00:00:00 +0000

I am excited to announce that I will be moving to the United States later this year to pursue a PhD in economics at Stanford University.

Stanford’s economics PhD program ranks among the best in the world. It begins with two years of advanced coursework on microeconomics, macroeconomics, econometrics, and field courses relevant to my academic interests. This coursework will strengthen my technical and research skills, and prepare me for writing a PhD thesis that contributes substantively to the economic research literature.

One topic that interests me is how people overcome uncertainty when forming teams. For example, the students in my cohort face uncertainty about who among the Stanford faculty will be the best supervisor(s) for their eventual theses. Likewise, faculty members face uncertainty about which students will be the best candidates to supervise. Participating in lectures and seminars will help students and faculty estimate their match qualities, leading to more informed and productive matches.

Another topic that interests me is how people share information in networks. For example, my blog posts on information gerrymandering and DeGroot learning use mathematical models to analyse how inter-personal connections influence peoples’ decisions and beliefs. I am looking forward to learning more about these and related models, and their application to “real-world” social and economic systems.

Uniform sums and Euler's number

Mon, 09 Mar 2020 00:00:00 +0000

Suppose I sample values uniformly at random from the unit interval. How many samples should I expect to take before the sum of my sampled values exceeds unity?¹

Let $N$ be the (random) number of samples taken when the sum first exceeds unity. Then $N$ has expected value $E[N]$ equal to Euler’s number $e\approx2.718282$. This can be verified approximately via simulation:

simulate <- function(run) {
  tot <- 0
  N <- 0
  while (tot < 1) {
    tot <- tot + runif(1)
    N <- N + 1
  }
  N
}

set.seed(0)
mean(sapply(1:1e5, simulate))

## [1] 2.7183

To see why $E[N]=e$, let $(X_i)_{i=1}^\infty$ be an infinite sequence of random variables with uniform distributions over the unit interval. Then the probability that $N$ exceeds any non-negative integer $n$ is $$\Pr(N>n)=\Pr(X_1+X_2+\cdots+X_n<1).$$ Consider the unit (hyper)cube in $\mathbb{R}^n$. Its vertices comprise the origin, the standard basis vectors $e_1,e_2,\ldots,e_n$, and the sums of two or more of these basis vectors. The convex hull of $\{0,e_1,e_2,\ldots,e_n\}$ forms an $n$-simplex with volume $1/n!$. The interior of this simplex is precisely the set $$\{X_1,X_2,\ldots,X_n\in[0,1]:X_1+X_2+\cdots+X_n<1\}.$$ It follows that $\Pr(X_1+X_2+\cdots+X_n<1)=1/n!$ and therefore $\Pr(N>n)=1/n!$ from above. Now $$\Pr(N=n)=\Pr(N>n-1)-\Pr(N>n)$$ for each $n\ge1$. Thus, since $\Pr(N>0)=1$ (and, by convention, $0!=1$), we have $$\begin{align} E[N] &= \sum_{n=1}^\infty n\Pr(N=n) \\ &= \sum_{n=1}^\infty n\left(\Pr(N>n-1)-\Pr(N>n)\right) \\ &= \Pr(N>0)+\sum_{n=1}^\infty\Pr(N>n) \\ &= 1+\sum_{n=1}^\infty\frac{1}{n!} \\ &= \sum_{n=0}^\infty\frac{1}{n!} \\ &= e. \end{align}$$ The final equality comes from evaluating the Maclaurin series expansion of $e^x$ at $x=1$.

Grant Sanderson mentions this problem in this Numberphile video. ↩︎

Triadic closure at the NBER

Wed, 04 Mar 2020 00:00:00 +0000

Fafchamps et al. (2010) describe a model of team formation in which people learn about potential collaborators via existing collaborators. These “referrals” provide information about potential collaborators’ match qualities, allowing people to screen each other and sort into more productive teams. Fafchamps et al. argue, and demonstrate empirically, that this referral mechanism leads to more teams being formed among people who are closer in the collaboration network.

Fafchamps et al.‘s referral model implies that triads in collaboration networks should tend to close over time; that is, people should tend to collaborate with others with whom they share common collaborators. One way to measure such closure is via the (global) clustering coefficient, which measures the rate at which pairs of nodes with a common neighbour are also adjacent. For example, in the NBER working paper co-authorship network, about 15% of the pairs of authors who share common co-authors are co-authors themselves. In contrast, we would expect this to happen 0.27% of the time in a random network with the same degree distribution, and 0.04% of the time in a random network with the same number of nodes and edges. Thus, the NBER co-authorship network is much more clustered than would be expected if authors chose co-authors randomly.

Another way to measure triadic closure is by computing the rate at which pairs of nodes with common neighbours become adjacent. This method makes sense whenever the network’s density grows over time. Such growth occurs in the NBER co-authorship network through co-authorships of new working papers. The network contains 32,034 pairs of eventual co-authors, 1,861 of whom share common co-authors at an earlier stage of the network’s evolution. However, 340,235 of the 342,096 pairs of authors with common co-authors never become co-authors themselves. Thus, only 0.54% of the unclosed triads in the NBER co-authorship network ever become closed.

How can we reconcile the NBER co-authorship network’s high clustering coefficient with its low triad closure rate?¹ One explanation could be that referrals primarily attract collaborators on current projects rather than potential future projects. Suppose I’m writing a paper with Alice, who suggests that Bob may have some valuable insights on our research, and that Bob and I might work well together. It turns out that Bob does have valuable insights and that we do work well together, and Alice and I decide to make him a co-author on our paper.² We publish our research as an NBER working paper, and Alice, Bob and I appear as a closed triad in the NBER co-authorship network (but never as an unclosed triad).

If intra-project closure is common then we would expect a high clustering coefficient and low triad closure rate in the NBER co-authorship network. The open triads in the network would be the triads for which successful referrals did not occur during co-authorship, and the factors that prevented such referrals may persist after the paper is published.

Researchers in the NBER co-authorship network may collaborate in ways not captured by the network. For example, working papers published in the NBER series must have at least one NBER-affiliated author, so papers written exclusively by non-affiliates are not observed in my data. If co-author referrals primarily lead non-affiliates to collaborate, and if such collaboration does not culminate in NBER working paper publications, then we would expect to observe a low triad closure rate. However, we would also expect a low (perhaps lower than 0.15) clustering coefficient because the triads containing non-affiliates would remain mostly open. ↩︎
Barnett et al. (1988) and Hamermesh (2013) suggest that co-authorship is increasingly used as compensation for colleagues’ research assistance. ↩︎

Degree-preserving randomisation

Mon, 17 Feb 2020 00:00:00 +0000

My previous post used degree-preserving randomisation (DPR) to control for network structure when estimating the effect of edge noise on nodes’ centrality rankings. The idea was that nodes may be connected in ways that amplify or suppress the effects of noise, and randomising nodes’ connections helps to balance these effects by averaging over the network’s possible structures.

DPR can also be used to test whether a network’s structure is significantly different than would be expected for a random network with the same degree distribution. For example, comparing a network’s clustering coefficient to the mean clustering coefficient among a sample of degree-preserving random networks reveals whether the original network is significantly more or less clustered than it would be, on average, if nodes’ connections were random. In contrast to Erdös-Rényi randomisation (ERR)—that is, generating a random network with the same number of nodes and edges—DPR separates variation in degree distributions from variation in other properties observed across sampled random networks.

Consider, as an example, the Motu working paper co-authorship network. The table below presents the network’s median node degree, global clustering coefficient, and mean geodesic distance. The table also presents the sample means and standard deviations of these properties across 50 degree-preserving and Erdös-Rényi randomisations of the co-authorship network.

Property	Actual value	DPR sample mean (sd)	ERR sample mean (sd)
Median degree	3.00	3.00 (0.00)	7.88 (0.33)
Clustering coefficient	0.52	0.16 (0.01)	0.04 (0.00)
Mean distance	2.72	2.83 (0.03)	2.74 (0.01)

By definition, DPR preserves the degree distribution and, consequently, always delivers the same median degree as the co-authorship network. In contrast, ERR removes the inequality in node degrees (arising, for example, from preferential attachment) and, consequently, delivers median degrees centred on the co-authorship network’s mean degree.

The co-authorship network is about 13 times more clustered than would be expected for an Erdös-Rényi random network with same number of nodes and edges. Controlling for the degree distribution drops this factor to just over three. In contrast, the mean distance between nodes in the co-authorship network is closer to what we would expect in a comparable Erdös-Rényi random network than in a degree-preserving random network.

Centrality rankings with noisy edge sets

Fri, 14 Feb 2020 00:00:00 +0000

Suppose I want to rank the centralities of nodes in a network. The network’s node set is correct, but its edge set is “noisy” in that it includes some false edges and excludes some true edges. How sensitive to this noise are the rankings of nodes from most to least central?

One way to answer this question empirically is to perturb an observable “true” network by adding and deleting edges randomly. This can be achieved by generating an Erdös-Rényi (ER) random network with the same node set as the true network, and defining a “noisy” network with edge set equal to the symmetric difference of the true and ER networks’ edge sets. This method “swaps” the states (from “present” to “not present”, or vice versa) of the true network’s edges at random. Varying the edge creation probablity in the ER network varies the amount of noise in the noisy network’s edge set.

I demonstrate this “random edge swapping” method by applying it to the Motu working paper co-authorship network. First, I generate 30 ER networks and 30 corresponding noisy networks for a range of edge swap probabilities. I then compute nodes’ betweenness, degree and PageRank centralities in the co-authorship networks with and without noise, and calculate the Spearman rank correlation between the true and noisy centralities using each of the three measures. Finally, I compute the sample means and 95% confidence intervals for the measure-specific rank correlations across the simulation runs associated with each edge swap probability. I present these means and confidence intervals in the left panel of the plot below. The right panel presents similar information, but with a degree-preserving randomisation of the co-authorship network within each simulation run before introducing noise. This randomisation allows me to control for the effect of the co-authorship network’s structure on my rank correlation estimates.

Increasing the edge swap probability decreases the consistency between the true and noisy centrality rankings for each of the three centrality measures I analyse. Intuitively, the more noise there is in the edge set, the less similar are the true and noisy co-authorship networks, and so the less correlated are the centrality rankings of the nodes in these networks.

Degree centrality rankings are the least sensitive to edge noise. Adding or deleting edges moves the incident nodes up or down the degree rank order, but leaves the relative ranks among non-incident nodes intact. Degree-preserving randomisation, by definition, does not affect nodes’ degree centrality rankings and so does not change the sensitivity of those rankings to noise.

PageRank centrality rankings are more sensitive to edge noise. Since nodes’ PageRank centralities depend on the PageRank centralities of their neighbours, the effect of adding or deleting edges spills over to some non-incident nodes and, consequently, disrupts the PageRank rank order more than the degree rank order. Controlling for network structure increases the influence that degree has on PageRank centrality and, consequently, decreases the sensitivity of PageRank centrality rankings to errant edges.

Betweenness centrality rankings are the most sensitive to edge noise. Adding or deleting edges can create or destroy short(est) paths between nodes, leading to radical changes in betweenness centrality for nodes on these paths. Controlling for network structure suppresses these changes by reducing the initial inequality in betweenness centralities. About 71% of nodes in the true co-authorship network have betweenness centralities equal to zero, whereas 20% of nodes in the randomised networks have betweenness centralities equal to zero. Consequently, nodes in the randomised networks typically have “less betweenness to gain or lose” than nodes in the true network, diminishing the effect of errant edges on betweenness centrality rankings.

motuwp is now an R package

Sat, 08 Feb 2020 00:00:00 +0000

My current project at Motu involves analysing co-authorship networks. It is helpful for me to have a small example network that I can use to, for example, compare sampling techniques. The Motu working paper co-authorship network is my go-to. Since I work mostly in R, I have converted the repository containing the underlying authorship data to an R package. This package can be installed from GitHub via remotes:

library(remotes)

install_github('bldavies/motuwp')

motuwp provides two data frames: papers, containing working paper attributes, and authors, containing author-paper pairs. These pairs can be used to construct a co-authorship network as follows:

library(igraph)
library(motuwp)

# Method 1: Project bipartite author-paper network onto author set
bip <- graph_from_data_frame(authors, directed = F)
V(bip)$type <- V(bip)$name %in% authors$author
net <- bipartite_projection(bip, which = 'true', multiplicity = F)

# Method 2: use convenience function that returns same network
net <- coauthorship_network()

The co-authorship network net contains 185 nodes and 729 edges. These values are larger than the corresponding values of 82 and 218 reported in my mid-2019 blog post on the network. The increases are due to me adding (i) the remaining working papers from 2019, (ii) some papers with missing landing pages, and (iii) authors with no hyperlinked profile page on Motu’s website.

NBER (co-)authorships

Fri, 07 Feb 2020 00:00:00 +0000

I recently updated the R package nberwp to include data on NBER working paper authorships. These data describe a bipartite author-paper network containing 13,571 authors and 26,586 papers. On average, each author has 4.35 papers and each paper has 2.22 authors.

The co-authorship network among NBER authors—that is, the bipartite projection of the author-paper network onto the set of authors—contains 0.03% of the possible edges among the 13,571 authors in the network. On average, each author has 4.72 unique co-authors across the working paper series. About 95% of authors belong to a single connected component of the co-authorship network, while 139 authors have no co-authors.

One challenge that arises when constructing co-authorship networks is disambiguating authors’ names. Slight misspellings may split a single author into many nodes, while many authors with the same name may be merged into a single node. These false splits and merges inhibit one’s ability to draw robust inferences about collaborative behaviour from the co-authorship network’s structure.

It is easiest to disambiguate author names when they can be cross-referenced against other data. The NBER RePEc index, from which I extract the authorship data, links some authors to their RePEc author IDs. These IDs allow me to merge some authors who publish under varying names. I also merge authors with (i) sufficiently similar names and (ii) overlapping neighbourhoods in the co-authorship network. Criterion (i) assumes that authors’ names tend to vary from their true values by a few characters only, while criterion (ii) assumes that authors tend to write multiple papers with the same set of co-authors. Combined, these criteria form a computationally feasible heuristic for identifying and resolving false splits.

In contrast, I do not attempt to identify false merges. One method could be to look for authors who bridge otherwise distant parts of the co-authorship network. This method assumes that authors tend to sort into clusters (e.g., by research interest) and that links between clusters are uncommon. However, this assumption defies the empirical evidence that the co-authorship network among economists has a small-world structure (Goyal et al., 2006).

DeGroot learning in social networks

Mon, 27 Jan 2020 00:00:00 +0000

The first book on my reading list for 2020 was Matthew Jackson‘s The Human Network. Its seventh chapter discusses DeGroot learning as a process for building consensus among members of a social network.

Consider a (strongly) connected social network among $n$ people. These people have private information that they use to form independent initial beliefs $b_1^{(0)},\ldots,b_n^{(0)}$ about the value of some parameter $\theta$. Recognising that their information sets may be incomplete, everyone updates their beliefs in discrete time steps by iteratively adopting the mean belief among their friends. This process spreads the information available to each individual throughout the network, allowing peoples’ beliefs to converge to a consensus estimate $\hat\theta$ of $\theta$.¹

The figure below presents an example of this setup. It shows the social network among eight people after zero, one, two, and three time steps. Nodes represent people, and are coloured according to the deviation of peoples’ beliefs above (orange) or below (purple) $\theta$'s true value (white). Edges represent mutual friendships. Over time, the information embedded in peoples’ initial beliefs diffuses throughout the network and the variation in beliefs around $\hat\theta$ collapses to zero.

People with more friends have more influence on the consensus estimate because they have more avenues through which to spread information. One can formalise this claim as follows. Let $b^{(t)}=(b_1^{(t)},\ldots,b_n^{(t)})$ be the $n\times 1$ vector of time $t$ beliefs. This vector evolves according to $$b^{(t+1)}=Wb^{(t)},$$ where $W=(W_{ij})$ is a row-stochastic $n\times n$ matrix with entries $W_{ij}$ equal to the (time-invariant) weight that person $i$ assigns to the beliefs of person $j$ at each time step. Notice that $b^{(t)}=W^tb^{(0)}$ and so the $n\times1$ vector $b^{(\infty)}=(\hat\theta,\ldots,\hat\theta)$ of consensus estimates is given by $$b^{(\infty)}=\lim_{t\to\infty}W^tb^{(0)}.$$

In the context of DeGroot learning in social networks, we have $$W_{ij}=\frac{A_{ij}+I_{ij}}{d_i+1},$$ where $A=(A_{ij})$ is the adjacency matrix for the social network, $d_i=\sum_{j=1}^nA_{ij}$ is person $i$'s degree in that network, and $I=(I_{ij})$ is the $n\times n$ identity matrix. Adding one in the numerator (if $i=j$) and denominator reflects person $i$ including their own beliefs when computing the mean among their friends.

The matrix $W$ describes a Markov chain $\mathcal{M}$ on the set of $n$ people. Assuming that the social network is (strongly) connected implies that $\mathcal{M}$ is irreducible and aperiodic. It follows from the Perron-Frobenius theorem that $$\lim_{t\to\infty}W^t=1_n\pi,$$ where $1_n$ is the $n\times1$ vector of ones and $\pi$ is a $1\times n$ row vector corresponding to the unique stationary distribution of $\mathcal{M}$; that is, $\pi$ uniquely solves $$\pi W=\pi$$ subject to the constraints that $\pi_j\ge0$ for each $j$ and $\sum_{j=1}^n\pi_j=1$.

Now, let $v$ be the $1\times n$ row vector with entries $v_j=(d_j+1)/\sum_{k=1}^n(d_k+1)$. Then $v_j\ge0$ for each $j$ and $\sum_{j=1}^nv_j=1$. Moreover, since $A$ is symmetric (and so $d_j=\sum_{i=1}^nA_{ij}$), $$\begin{align} (v W)_j &=\sum_{i=1}^nv_iW_{ij}\\ &=\sum_{i=1}^n\frac{d_i+1}{\sum_{k=1}^n(d_k+1)}\frac{A_{ij}+I_{ij}}{{d_i+1}}\\ &=\frac{d_j+1}{\sum_{k=1}^n(d_k+1)}\\ &=v_j \end{align}$$ for each $j$ so that $vW=v$ and therefore $\pi=v$ by uniqueness. Thus, the consensus estimate is given by $$\hat\theta=\frac{\sum_{k=1}^n(d_k+1)b_k^{(0)}}{\sum_{k=1}^n(d_k+1)}.$$ Finally, the influence that person $i$ has on $\hat\theta$ is captured by the partial derivative $$\frac{\partial\hat\theta}{\partial b_i^{(0)}}=\frac{d_i+1}{\sum_{k=1}^n(d_k+1)},$$ which is an increasing linear function of person $i$'s degree $d_i$ in the social network.

Convergence is guaranteed if the social network is strongly connected (Golub and Jackson, 2005). ↩︎

White elephant gift exchanges

Wed, 11 Dec 2019 00:00:00 +0000

Motu’s staff Christmas party is this Friday. We’re planning a white elephant gift exchange: everyone contributes a wrapped gift to a common pool and sequentially chooses to either (i) unwrap a gift or (ii) steal a previously unwrapped gift. “Victims” of theft make the same choice, but previously stolen gifts cannot be re-stolen until a new gift is unwrapped. The exchange ends when the last gift is unwrapped.

Suppose I want to maximise the subjective value of the gift in my possession when the exchange ends. I must overcome two strategic challenges: I don’t know the subjective values of wrapped gifts, and I don’t know other players’ subjective values of wrapped or unwrapped gifts. Therefore, any strategy I adopt must account for uncertainty both in wrapped gifts’ subjective values and in the propensity of other players to steal unwrapped gifts I covet.

One strategy could be to always steal the unwrapped gift with the highest subjective value. This strategy is risky because my subjective valuations might correlate with those of other players, making it more likely I will become a victim of theft. I could hedge this risk by instead always stealing the unwrapped gift with the second highest subjective value (unless I’m the last player, in which case I would be better off stealing the most subjectively valuable gift because it can’t be re-stolen). Alternatively, I could play as a pacifist and never steal (unless I’m the last player).

I compare these three strategies—greediness, hedged greediness, and pacifism—via simulation. I assume gifts’ subjective values are determined as the mean of two standard uniform random variables: one describing an underlying value common to all players, and one describing an idiosyncratic component unique to each player. I simulate 1000 games among 30 players, randomising the strategies adopted by each player in each game.

For each simulated game, I compute the subjective value of the gift in each player’s possession when the exchange ends. I also compute the allocation that maximises aggregate (i.e., the sum of) subjective values. I refer to the subjective values in this allocation as “efficiency baselines,” and use them to compare strategies’ tendencies to deliver socially optimal allocations. I summarise my simulation results in the plot below.

Across all strategies, players whose turns arrive later in the game tend to be better off. Such players have more choices of gifts to steal and fewer opportunities to become victims of theft. Greedier players tend to end up with more subjectively valuable gifts, while pacifists—who never use victimisation as an opportunity to “trade up”—typically possess the least subjectively valuable gifts when the exchange ends. Only late and/or greedy players tend to do better than under the socially optimal allocation.

Choosing not to steal is risky because it may result in unwrapping a low-value gift that no other players want to steal. The first player, who cannot steal, is particularly exposed to this risk. The game could be made fairer by allowing the first player (and subsequent victims) to unilaterally swap gifts when everyone else has had their turn. This adjustment shifts the disadvantage to the second player, who, in the game’s pre-swap phase, has only two choices: steal from the first player or unwrap a new gift. Giving more players a second turn could improve the final gift allocation by giving early players a larger choice set.

The table below shows how the efficiency and equity of the final gift allocation varies with the number of early players given a second turn. I measure efficiency by the ratio of aggregate subjective values to aggregate efficiency baselines. I define equity as one minus the Gini coefficient for the distribution of subjective values. The table reports 95% confidence intervals across 1000 simulated games.

Players given second turn	Efficiency (%)	Equity (%)
0	83.9 ± 0.2	80.3 ± 0.2
1	88.8 ± 0.2	83.0 ± 0.2
2	88.7 ± 0.2	82.8 ± 0.2
3	88.6 ± 0.2	82.7 ± 0.2
4	88.6 ± 0.2	82.6 ± 0.2
5	88.3 ± 0.2	82.4 ± 0.2

Giving the first player a second turn makes the final allocation more efficient and more equitable. That player gets a chance to improve upon their initial endowment, and subsequent victims get a chance to reconsider their choices with more information about the distribution of gifts’ subjective values. However, on average, giving further players a second turn appears to push efficiency and equity back down.

Birds, voting, and Russian interference

Sun, 17 Nov 2019 00:00:00 +0000

Since 2005, Forest and Bird has run annual elections for New Zealand’s Bird of the Year. This week Radio New Zealand announced the yellow-eyed penguin as 2019’s winner. A follow-up tweet by Forest and Bird raised suspicions about possible Russian interference into the vote’s outcome.

Forest and Bird’s tweet includes a world map with countries coloured by voter turnout. The bar chart below presents the same information in a less exciting format.¹

Russian votes account for 193 of the 15,044 votes with known country of origin. New Zealand contributed 12,651 such votes. Fully 28,416 votes had unknown origin and were excluded from the set of votes used to determine the winning bird.

This year’s election used an instant-runoff system. Voters reported up to five of their favorite birds, ranked in order of preference. Beginning with voters’ first preferences, birds with the least votes were eliminated sequentially and their votes reallocated to voters’ next favorites. This process continued until one bird remained.

The table below reports the last five birds eliminated by the instant-runoff process among the votes cast from anywhere, from known countries, from New Zealand, from Russia, and from known countries excluding Russia. The bracketed percentages represent the share of voters from each country who preferred the top two candidates in the final round. For example, 61.6% of New Zealanders with preferences over the yellow-eyed penguin and the kākāpō preferred the former.

Place	All countries	Known countries	New Zealand	Russia	Known countries ex. Russia
1	Yellow-eyed penguin (52.4%)	Yellow-eyed penguin (58.7%)	Yellow-eyed penguin (61.6%)	Kākāpō (52.0%)	Yellow-eyed penguin (59.0%)
2	Kākāpō (47.6%)	Kākāpō (41.3%)	Kākāpō (38.4%)	Black Robin (48.0%)	Kākāpō (41.0%)
3	Black Robin	Banded Dotterel	Banded Dotterel	Barn Owl	Banded Dotterel
4	Banded Dotterel	Black Robin	Black Robin	Antipodean Albatross	Black Robin
5	Fantail	Kākā	Fantail	Southern Brown Kiwi	Kākā

Excluding votes from unknown countries did not affect which bird won. New Zealand voters got the outcome for which they voted, whereas Russian voters would have crowned the kākāpō. Removing Russian votes wouldn’t have changed the election outcome—to the extent that Russians did interfere with the vote, their interference was not successful.

The data used in this post are copyright Forest and Bird, and are released under a CC BY 4.0 license. They are available here. ↩︎

How central is Grand Central Terminal?

Thu, 14 Nov 2019 00:00:00 +0000

I spent most of October travelling in the United States. I visited a range of large cities with correspondingly large subway systems. New York City’s is the most extensive, containing more stops than any other subway system in the world. Its crown jewel, Grand Central Terminal, provides access to many cultural and commercial attractions in Midtown Manhattan.

But just how central is Grand Central?

To help me answer this question, I created an R package nyctrains that provides data on the NYC subway network. These data include scheduled travel times between subway stops. I use these times to construct a travel-time-weighted directed network in which stops are adjacent if they occur consecutively along any route. I exclude stops along the Staten Island Railway, which is disconnected from the rest of the system. The plot below maps the resulting network, with nodes positioned by latitude/longitude and with edges coloured by route. (Some routes overlap.)

Estimating Grand Central’s centrality requires choosing a measure. One candidate is betweenness centrality. Stops are more betweennness-central if trains are more likely to pass through them when taking the fastest route between other stops.

Another candidate measure is closeness centrality. Stops are more (out-)closeness-central if they have shorter mean fastest travel times to all other stops. In the NYC subway network, some of these times are infinite because the network is not strongly connected. For example, it is not possible to get from Grand Central to Aqueduct Racetrack without exiting the subway system.

Closeness centrality measures the extent to which stops provide fast access to other stops. Another way to measure such access is to count the number of stops that can be reached within a specified time. For example, the chart below shows the number of stops that can be reached from Grand Central and Broadway Junction within an hour.

The number of stops reachable from Grand Central dominates the corresponding number from Brooklyn Junction for all but the smallest travel time allowances. One way to operationalise this fact is to observe that the area below the red curve exceeds the area below blue curve. In general, the area below the cumulative reach curve is larger for stops that provide access to more stops in less time. I compute this area for each stop as a measure of what I call “reach” centrality.¹

The table below reports betweenness and reach centralities for the ten most betweenness-central stops in the NYC subway network, excluding stops on Staten Island. I normalise centralities to have maximum values equal to unity.

Stop	Borough	Betweenness rank (value)	Reach rank (value)
Lexington Av / 59 St	Manhattan	1 (1.000)	23 (0.973)
125 St	Manhattan	2 (0.975)	118 (0.870)
Jay St - MetroTech	Brooklyn	3 (0.959)	46 (0.951)
86 St	Manhattan	4 (0.952)	81 (0.926)
Atlantic Av-Barclays Ctr	Brooklyn	5 (0.851)	92 (0.914)
149 St - Grand Concourse	Bronx	6 (0.794)	158 (0.814)
Grand Central - 42 St	Manhattan	7 (0.777)	3 (0.991)
14 St - Union Sq	Manhattan	8 (0.774)	1 (1.000)
Court Sq - 23 St	Queens	9 (0.763)	42 (0.953)
Broadway Junction	Brooklyn	10 (0.747)	172 (0.802)

Grand Central is the third most reach-central stop but only the seventh most betweeness-central, contributing to 22% fewer shortest paths than Lexington Avenue/59th Street station. Broadway Junction is less reach-central than Grand Central—consistent with the chart above—but almost as betweeness-central. The figure below shows the distribution of betweenness and reach centrality across the 424 stops in the network.

Betweenness-central nodes belong to many shortest paths, and so tend to congregate along bottlenecks and highways. For example, seven of the ten most betweenness-central stops in the NYC subway network provide access to the Lexington Avenue Express (routes 4, 5 and 5X), which is the fastest—but not only—route between Brooklyn and the Bronx. In contrast, reach centrality emanates from mid/lower Manhattan, which (i) is geographically dense with mutually nearby subway stops and (ii) contains the fastest inter-borough connections.

This approach could be improved by adjusting for variation in stops’ access to unique amenities so that some stops are more valuable to reach than others. However, this variation is not observable in my data. ↩︎

Climate change and transport planning

Wed, 06 Nov 2019 00:00:00 +0000

Last year I organised a dialogue on climate change adaptation within New Zealand’s transport sector. The purpose of the dialogue was to facilitate discussions between researchers, stakeholders and government on adaptation issues relevant to the sector. Motu Note #40, released today, summarises those discussions.

Climate change has uncertain supply-side impacts on transport because we don’t know when, where or how big will be the events (e.g., storms, floods and landslides) that threaten to damage our infrastructure. These impacts affect other parts of the transport network by diverting flows away from damaged areas and putting pressure on alternative routes.

Climate change also has uncertain demand-side impacts on transport through impacts on sectors that use the network. For example, climate change may trigger land use changes by altering the yields of different crops or the attractiveness of different settlement areas. These changes shift the spatial allocation of human activity and, consequently, shift users’ derived demand for transport infrastructure. However, it is unclear how people will vary their land use in response to climate change because such responses involve complex tradeoffs between economic, social and cultural factors.

The uncertainty around climate change impacts creates challenges for transport planners, who must forecast climate change itself, how people will respond to the change, and how those responses translate into spatial shifts in the derived demand for transport.

One solution is to apply real options analysis (ROA) to transport planning and investment decisions. ROA extends traditional cost-benefit analyses by accounting for managerial flexibility in response to the realisation of future uncertainties, such as the time and place of climate change impacts. For example, ROA provides tools for valuing the ability to abandon roads that get flooded during storms. These tools help planners identify investments that meet users’ needs across a range of climate change scenarios.

However, real options provide an incentive to delay investments in order to draw more samples from, and thereby learn more about, the temporal and spatial distributions of climate change impacts. Such delays halt investment decisions made by transport network users, who rely on the network to conduct economic and social activities, and by utility providers, who provide services that co-locate with transport infrastructure.

My coauthors and I discuss these issues further in Motu Note #40.

Computing epicycles

Sun, 03 Nov 2019 00:00:00 +0000

Earlier this year Grant Sanderson, creator of the YouTube channel 3blue1brown, posted a video explaining how Fourier series approximate periodic functions using sums of sines and cosines. In the video and its companion, Grant animates sets of vectors that rotate on circular orbits and, when summed together, reproduce a range of images defined by closed curves.

Consider, for example, the boundary of GitHub’s logo:

Let $\gamma:[0,1]\to\mathbb{R}^2$ be the closed curve in $\mathbb{R}^2$ defining the logo’s boundary. Suppose there is an integer $n$ such that $$\gamma(t) = \sum_{k=-n}^n \gamma_k(t)$$ for some set of circular orbits $\gamma_{-n},\ldots,\gamma_n:[0,1]\to\mathbb{R}^2$ and for all times $t\in[0,1]$. (Negative and positive subscripts correspond to clockwise and anti-clockwise orbits. Both may be necessary to reconstruct $\gamma$.) Each orbit $\gamma_k$ has time $t$ position defined by the vector $$\gamma_k(t) = \begin{bmatrix} r_k \cos(2\pi k t + \theta_k) \\ r_k \sin(2\pi k t + \theta_k) \end{bmatrix}$$ for some radius $r_k$, angular speed $2\pi k$ rad/s and initial phase $\theta_k$. Consequently, the curves $x,y:[0,1]\to\mathbb{R}$ defining the horizontal and vertical components of $\gamma$ must satisfy the system $$\begin{align} x(t) &= \sum_{k=-n}^n r_k\cos(2\pi k t + \theta_k) \\ y(t) &= \sum_{k=-n}^n r_k\sin(2\pi k t + \theta_k) \end{align}$$ of identities. Let $z:[0,1]\to\mathbb{C}$ be the curve with $z(t)=x(t)+iy(t)$ for all $t\in[0,1]$. Euler’s formula gives $$\begin{align} z(t) &= \sum_{k=-n}^n r_k(\cos(2\pi k t + \theta_k) + i \sin(2\pi k t + \theta_k)) \\ &= \sum_{k=-n}^n r_k \exp(2\pi i k t + i\theta_k) \\ &= \sum_{k=-n}^n c_k \exp(2\pi i k t), \end{align}$$ where each Fourier coefficient $c_k=r_k\exp(i\theta_k)$ has modulus $\lvert c_k\rvert=r_k$ and (principal) argument $\mathrm{Arg}(c_k)=\theta_k$. Now, notice that $$\begin{align} \int_0^1 z(t) \exp(-2\pi i k t)\, \mathrm{d}\,t &= \int_0^1\left(\sum_{j=-n}^n c_j \exp(2\pi i j t)\right)\exp(-2\pi i k t)\, \mathrm{d}\,t \\ &= \int_0^1c_k\, \mathrm{d}\,t + \sum_{j\not=k} c_j \int_0^1 \exp(2\pi i (j - k)t)\, \mathrm{d}\,t \\ &= c_k \end{align}$$ for each $k$ because $$\int_0^1 \exp(2\pi i (j - k)t)\, \mathrm{d}\,t = 0$$ for all integers $j\not=k$ by the $2\pi i$-periodicity of the complex exponential function. Thus $$c_k = \int_0^1 z(t) \exp(-2\pi i k t)\, \mathrm{d}\, t,$$ which can be calculated using Riemann sums given sample points along the component curves $x$ and $y$. Doing this calculation for each $k$, and computing the corresponding moduli $r_{-n},\ldots,r_n$ and arguments $\theta_{-n},\ldots,\theta_n$, provides enough information to generate the animation below.

Introducing nberwp

Tue, 24 Sep 2019 00:00:00 +0000

Today I published nberwp, an R package providing data on NBER working papers published between 1973 and 2018. It can be installed from GitHub via remotes:

library(remotes)
install_github('bldavies/nberwp')

nberwp provides a data frame papers, each row describing a unique working paper:

papers

## # A tibble: 25,413 x 4
##    number  year month title                                                     
##     <int> <int> <int> <chr>                                                     
##  1      1  1973     6 Education, Information, and Efficiency                    
##  2      2  1973     6 Hospital Utilization: An Analysis of SMSA Differences in …
##  3      3  1973     6 Error Components Regression Models and Their Applications 
##  4      4  1973     7 Human Capital Life Cycle of Earnings Models: A Specific S…
##  5      5  1973     7 A Life Cycle Family Model                                 
##  6      6  1973     7 A Review of Cyclical Indicators for the United States: Pr…
##  7      7  1973     8 The Definition and Impact of College Quality              
##  8      8  1973     9 Multinational Firms and the Factor Intensity of Trade     
##  9      9  1973     9 From Age-Earnings Profiles to the Distribution of Earning…
## 10     10  1973     9 Monte Carlo for Robust Regression: The Swindle Unmasked   
## # … with 25,403 more rows

number uniquely identifies working papers by their positions in the series, while year and month capture papers’ publication dates. The chart below uses these dates to show the NBER catalogue’s expansion.

title facilitates simple text mining, such as determining which words are used in working paper titles most frequently:

library(tidytext)

words <- papers %>%
  unnest_tokens(word, title) %>%
  anti_join(get_stopwords()) %>%
  filter(nchar(gsub('[a-z.]', '', word)) == 0) %>%
  distinct(number, word)

words %>%
  count(word, sort = T)

## # A tibble: 11,636 x 2
##    word         n
##    <chr>    <int>
##  1 evidence  2615
##  2 policy    1350
##  3 market    1322
##  4 effects   1193
##  5 trade     1052
##  6 capital    979
##  7 labor      940
##  8 economic   910
##  9 u.s        882
## 10 health     875
## # … with 11,626 more rows

Many papers discuss capital and labour markets, and the effects of public policies. The word “evidence” appears in twice as many titles as any other (non-stop) word, which I suspect reflects the growing use of the “<Issue>: Evidence from <context>” title format:

The NBER’s RePEc index, from which I derive papers, also contains data linking papers to their authors. I plan to include these data in a future version of nberwp once I’ve disambiguated authors’ names.

Information gerrymandering

Sat, 14 Sep 2019 00:00:00 +0000

Last week Nature published “Information Gerrymandering and Undemocratic Decisions,” an article analysing the effect of peer influences on the outcome of collective decisions.

Suppose, for example, that a 24-member committee must collectively decide whether to adopt a new policy. The committee agrees to make the decision by vote, and will action whichever choice—accept or reject—wins a two-thirds majority. One week before the vote, half of the committee members support the policy and half want it rejected. Fearing stagnation, each member updates their position daily to match the majority among their six most trusted colleagues. This update process allows committee members to influence each others’ positions, potentially shifting the split vote to a decisive majority.

Assuming trust is pairwise mutual, the “influence network” among committee members can be modelled as a 6-regular graph on 24 vertices, with edges connecting influencers. The function below uses this regular graph model to simulate the outcome of many votes:

simulate_votes <- function(n_votes, committee_size, n_influences, n_days) {
  
  # Create regular graph and identify neighbours
  net <- igraph::k.regular.game(committee_size, n_influences)
  nb <- igraph::neighborhood(net)
  
  # Define function for simulating one vote
  simulate_one <- function(vote) {
    accepts <- vector('double', n_days)
    init_positions <- sample(rep(c(0, 1), committee_size %/% 2), replace = F)
    positions <- init_positions
    for (day in seq_len(n_days)) {
      positions <- purrr::map_dbl(nb, ~(1 * (mean(positions[.]) >= 0.5)))
      accepts[day] <- committee_size * mean(positions)
    }
    list(init_positions = init_positions, accepts = accepts)
  }
  
  # Simulate many votes
  votes <- lapply(seq_len(n_votes), simulate_one)
  
  # Return results
  list(network = net, results = votes)
}

simulate_one randomises committee members’ initial positions—encoding “accept” as one and “reject” as zero—before updating these positions based on neighbouring majorities. Running simulate_one many times allows me to simulate the committee’s decision for an ensemble of randomly generated influence networks. The last few lines of simulate_votes generate this ensemble and output the simulation results.

Let’s simulate the committee’s vote 1000 times, including one week of daily position updates, and tabulate the simulated decision frequencies:

# Run simulations
committee_size <- 24
set.seed(0)
votes <- simulate_votes(1000, committee_size, 6, 7)

# Define function for converting vote counts to committee decisions
get_decision <- function(accepts) {
  dplyr::case_when(
    accepts >= committee_size * 2 / 3 ~ 'Accept',
    accepts <= committee_size / 3 ~ 'Reject',
    TRUE ~ 'Deadlock'
  )
}

# Tabulate decision frequencies
tibble(accepts = map(votes$results, ~tail(.$accepts, 1))) %>%
  mutate(Decision = get_decision(accepts)) %>%
  count(Decision, name = 'Frequency') %>%
  knitr::kable(align = 'c')

Decision	Frequency
Accept	292
Deadlock	429
Reject	279

Variation in decisions comes from variation in the influence network’s structure. To see how, let $\Delta_i$ denote the proportion of committee member $i$'s influencers with the same initial position on the policy as member $i$, and define $$a_i = \begin{cases} \Delta_i & \text{if}\ \Delta_i\ge 1/2\\ -(1 - \Delta_i) & \text{otherwise}. \end{cases}$$ The variable $a_i$ captures the “influence assortment” of committee member $i$. Positive influence assortment means that they mainly agree with their influencers; negative influence assortment means that they mainly disagree.

Now let $\mathcal{A}$ and $\mathcal{R}$ be the sets of committee members whose initial positions are to accept and reject the policy, and consider the difference $$G = \frac{1}{\lvert\mathcal{A}\rvert}\sum_{i\in\mathcal{A}} a_i - \frac{1}{\lvert\mathcal{R}\rvert}\sum_{j\in\mathcal{R}} a_j$$ in mean influence assortments between these sets. The “influence gap” $G$ is greater than zero precisely when committee members in $\mathcal{A}$ are, on average, more positively influence assorted than committee members in $\mathcal{R}$.

The scatter plot below shows that $G$ correlates positively with the probability that the committee accepts the policy. Intuitively, positive influence gaps characterise influence networks with disproportionately many neighbouring majorities in favour of acceptance, which, consequently, makes voting to accept the policy more likely.

The relationship between influence gaps and vote outcomes creates an incentive to gerrymander the influence network to make preferred outcomes more likely. For example, a subset of committee members wanting to accept the policy could cooperate to gain the trust of specific members so as to construct a positive influence gap. In political and legal contexts (e.g. elections and jury votes), bad actors may act on the incentive to gerrymander voters’ influences and, in doing so, pervert the democratic process.

The Nature article extends my model in three ways:

it generalises to directed influence networks by relaxing the assumption of pairwise mutual trust;
it uses a more elaborate rule for updating positions;
it introduces stubborn committee members (“zealots”) who never change their position.

However, none of these extensions change the model’s prediction: gerrymandering influence networks can lead to undemocratic decision-making by biasing the outcome of otherwise-split votes.

Sampling the Motu coauthorship network

Tue, 30 Jul 2019 00:00:00 +0000

Suppose I have some data that describe a bipartite author-publication network. I want to analyse the underlying coauthorship network—that is, the bipartite projection onto the set of authors—but I can’t compute that network because the data are too large to fit into memory. Instead, I estimate properties of the full coauthorship network by sampling the author-publication incidence data before computing the bipartite projection.

If the incidence data are stored as a matrix then I can sample its rows or columns, which corresponds to sampling the author or publication sets. If the incidence data are stored as a list of author-publication pairs then I can sample these pairs, which corresponds to sampling edges in the bipartite network.

Which of these three methods—author, publication and edge sampling—most reliably estimates the full coauthorship network’s properties?

To develop some intuition, I apply each sampling method to the coauthorship network among Motu researchers. The data describing this network are small enough that I can compute the true values of various network properties, which I compare with the sampling distributions of such values generated by each sampling method.

The table below reports the 95% confidence intervals for each property under each method, in all cases sampling (uniformly at random and without replacement) about half of the corresponding entities (i.e., authors, publications or edges) before computing the bipartite projection onto the set of authors.¹

Property	True value	Author sampling	Pub. sampling	Edge sampling
Order	82.00	42.00 ± 0.00	64.40 ± 0.60	64.90 ± 0.60
Size	218.00	56.10 ± 2.60	137.40 ± 3.40	81.80 ± 1.40
Density (%)	6.56	6.50 ± 0.30	6.70 ± 0.20	4.00 ± 0.10
Mean distance	2.52	2.60 ± 0.10	2.70 ± 0.00	3.10 ± 0.00
Transitivity (%)	30.91	30.80 ± 2.00	31.90 ± 1.30	24.80 ± 1.20

All three methods under-estimate the order and size of the full coauthorship network. However, this is partly by construction: sampling any proportion of authors will always deliver that proportion of nodes in the coauthorship network, and taking a strict subset of publications or edges will generally omit some inter-author connections.

Author and publication sampling deliver accurate density and transitivity estimates. Edge sampling is less accurate: it produces relatively sparse networks in which authors are more distant, and less likely to share common coauthors, than in the full network.

The chart below plots the sample means and 95% confidence intervals generated by each sampling method for varying sampling rates. (A sampling rate of p% means that I randomly select p% of the corresponding entities before computing the coauthorship network.) As the sampling rate rises, the sample means converge to the true value. I vertically nudge the plotted points to prevent overlaps and make it easier to compare methods at each sampling rate.

Publication sampling over-estimates the coauthorship network’s density at low sampling rates. This could be because most working papers are written by authors in the densely connected core of the coauthorship network, so publication sampling is more likely to recover this core than the less connected and less productive periphery.

Edge sampling appears to generate biased density and transitivity estimates. Intuitively, pairs of sampled edges are unlikely to be incident with the same publication and thus unlikely to form an edge in the bipartite projection.

All three methods under-estimate the mean distance between authors at low sampling rates but over-estimate this distance at high sampling rates. This pattern arises because the distance calculation considers connected nodes only. At low sampling rates, most connected components are dyads or triads, and so the distances between connected nodes are small. The number of nodes in each component rises with the sampling rate, which leads to mean distance over-estimates until the number of edges within each component catches up.

Within each sample, I delete authors with no publications and publications with no authors. ↩︎

Updating motuwp

Sun, 28 Jul 2019 00:00:00 +0000

Today I updated the motuwp GitHub repository, which stores data on Motu working papers and their authors. I made three main changes:

First, I switched from BeautifulSoup to rvest for scraping the working paper directory. My original Python script used a bunch of regex commands to build the list of working paper URLs, despite warnings that regular expressions and HTML generally don’t cooperate. I should have just used CSS selectors, which I now do using data.R.

Second, I implemented a caching mechanism for passing information between runs of data.R. The script queries only papers released since the last run, so adding new papers is faster and requires fewer HTTP requests.

Third, I added working paper titles to the information collected. This allows me to, for example, use tf-idf scores to characterise research areas:

College degrees in the US: Community detection

Sat, 27 Jul 2019 00:00:00 +0000

In my last post, I compared measures of similarity among college degree fields. My goal in this post is to partition the set of fields such that each field has greater within-part similarities than between-part similarities. One approach is to hierarchically cluster fields based on their similarities, producing a dendrogram that can be cut at different heights to obtain different partitions. Generating the dendrogram restricts my choice set but, ultimately, I still have to choose which partition is “best.”

The intellectually honest way forward is to define an objective function on the set of partitions and choose the partition that obtains the function’s maximum. One such function is network modularity, which captures the extent to which groups of nodes are intra-connected densely but inter-connected sparsely. Ranking partitions by modularity removes the need for supervision: rather than making a subjective, potentially biased judgment on which partition is “best,” I simply choose the partition that maximises modularity.

Unfortunately, maximising modularity is hard. In most cases, finding the globally optimal partition is infeasible and a heuristic algorithm must be used to find an approximate solution. Clauset et al. (2004) suggest a greedy algorithm:

Assign every node to a unique “community.”
Find the pair of communities whose union delivers the greatest increase in modularity. Replace these communities with their union.
Repeat step 2 until the modularity gain is negative or only one community remains.

The term “community” refers to a set of nodes and stems from the use of network science to probe the community structure of social interactions.

I apply Clauset et al.‘s algorithm to the networks defined using the co-occurrence, Dice, Jaccard, Ochiai and overlap measures discussed in my previous post, as well as the unweighted network in which fields are adjacent if at least one graduate studied them both. The table below presents the number and size of communities detected in each network, and the corresponding maximised modularity values.

Network	Communities	Fields	Community sizes (millions of graduates)	Modularity
Co-occurrences	6	19–51	6.4–19.3	0.380
Dice	8	11–50	1.7–16.7	0.456
Jaccard	8	11–50	1.7–19.8	0.457
Ochiai	8	13–40	1.4–15.8	0.433
Overlap	8	9–30	0.9–17.6	0.423
Unweighted	3	11–84	3.5–42.2	0.118

Clauset et al.‘s algorithm detects eight communities in the Dice, Jaccard, Ochiai and overlap similarity networks, with each community containing at least nine fields and at most 50 fields. The Jaccard measure delivers the greatest maximum modularity. Ignoring edge weights makes within- and between-part connections harder to separate, leading to few communities being detected.

I identify the “representives” of each community as the fields with the largest ratios of mean within- and between-community similarities. I transform these ratios by taking their natural logarithm in order to rein in the extreme values caused by near-zero divisors. The following bar chart presents the representatives of each community detected in the Jaccard similarity network.

Communities 2, 3, 4, 5, 7 and 8 appear to capture business, engineering, media, education, agriculture and biology-related fields. Communities 1 and 6 are less clearly classifiable.

The table below presents the demographic compositions of the eight communities detected in the Jaccard similarity network. Community 3 contains nearly 30% of degree fields but only about 20% of graduates, and is the most male-dominated among the eight communities detected. Community 5 is the most female-dominated and has the highest mean age. Educational attainment is lowest in communities 2 and 4, and highest in community 8.

Community	Fields	Total graduates (millions)	Mean graduate age	% of graduates female	% of graduates with post-graduate degree
1	28	19.8	48.4	64.6	39.7
2	11	14.4	47.8	41.6	26.5
3	50	14.1	47.6	28.4	42.9
4	18	9.6	45.1	60.8	27.6
5	16	6.5	54.7	76.7	49.0
6	18	3.6	43.4	70.8	33.6
7	17	2.1	47.3	35.4	33.8
8	15	1.7	45.6	54.6	51.7
Overall	173	71.8	47.9	52.7	36.7

College degrees in the US: Similarity measures

Sun, 14 Jul 2019 00:00:00 +0000

In my last post, I used the 2016 ACS PUMS data to analyse how educational attainment and degree field choices vary between demographic groups. I commented that the rates at which graduates pair fields together “provide insight into the intellectual connections between fields.” This post compares different ways of estimating the strength of such connections.

Field pair co-occurrences

The repository for this post contains the files observations.csv and fields.csv, which I import as follows.

library(readr)

data_url     <- 'https://raw.githubusercontent.com/bldavies/college-degrees/master/data/'
observations <- read_csv(paste0(data_url, 'observations.csv'))
fields       <- read_csv(paste0(data_url, 'fields.csv'))

observations aggregates the sample weights in the PUMS data by age, sex, and degree level and fields. I use these weights to construct a field pair co-occurrence matrix C:

library(dplyr)

C <- observations %>%
  # Aggregate sample weights by field pair
  filter(level > 0) %>%
  mutate(field2 = ifelse(is.na(field2), field1, field2)) %>%
  count(field1, field2, wt = weight) %>%
  mutate(n = n / 2) %>%
  # Identify weighted field-respondent pairs
  mutate(respondent = row_number()) %>%
  tidyr::gather(key, field, field1, field2) %>%
  count(field, respondent, wt = n) %>%
  # Count field pair co-occurrences
  widyr::pairwise_count(field, respondent, wt = n, diag = TRUE) %>%
  # Cast to matrix
  reshape2::acast(item1 ~ item2, value.var = 'n', fill = 0)

The diagonal elements of C estimate the total number of graduates with degrees in each field, while the off-diagonal elements estimate the number of graduates that chose each degree field pair. For example, the elements of the leading submatrix

C[1:5, 1:5]

##          1100     1101    1102     1103    1104
## 1100 181555.0    128.0     0.0    163.5     0.0
## 1101    128.0 124979.0   647.5    971.0   196.5
## 1102      0.0    647.5 47352.5    521.5     0.0
## 1103    163.5    971.0   521.5 173097.0   261.5
## 1104      0.0    196.5     0.0    261.5 46670.0

provide estimates for the degree fields listed in the first five rows of fields:

head(fields, 5)

## # A tibble: 5 x 2
##   field field_desc                           
##   <dbl> <chr>                                
## 1  1100 General Agriculture                  
## 2  1101 Agriculture Production And Management
## 3  1102 Agricultural Economics               
## 4  1103 Animal Sciences                      
## 5  1104 Food Science

About 125,000 graduates hold degrees in Agriculture Production And Management, nearly 1,000 of which also hold degrees in Animal Sciences. Agricultural Economics attracts about as many graduates as Food Science, but no respondents in the PUMS data reported studying both.

Similarity measures

The diagonal elements of C estimate the “size,” in units of graduates, of each degree field. The distribution of field sizes is positively skewed, with the largest field having more than 30 times the size of the smallest 50% of fields:

summary(diag(C))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10696   54731  142088  415163  407828 4275723

Using the elements of C to measure the strength of connections between fields may lead to biased inferences by, for example, making large fields with proportionally few graduates in common appear to have stronger connections than small fields with proportionally many graduates in common. One way to avoid such bias is to normalise each element $c_{ij}$ of C by the corresponding field sizes $s_i=c_{ii}$ and $s_j=c_{jj}$, thereby producing a scale-invariant “similarity” measure between pairs of degree fields.

Dividing $c_{ij}$ by the arithmetic mean $(s_i+s_j)/2$ yields the Dice coefficient $$\mathrm{Dice}(i,j) = \frac{2c_{ij}}{s_i+s_j},$$ while dividing $c_{ij}$ by the geometric mean $\sqrt{s_is_j}$ yields the Ochiai coefficient $$\mathrm{Ochiai}(i,j) = \frac{c_{ij}}{\sqrt{s_i\,s_j}}.$$ The Dice coefficient can be used to define the Jaccard index $$\begin{align} \mathrm{Jaccard}(i,j) &= \frac{c_{ij}}{s_i + s_j - c_{ij}} \\ &= \frac{\mathrm{Dice}(i,j)}{2 - \mathrm{Dice}(i,j)}, \end{align}$$ which is conceptually related to the overlap coefficient $$\mathrm{Overlap}(i,j) = \frac{c_{ij}}{\min(s_i, s_j)}$$ in that both capture the relative size of set intersections. These four similarity measures take values on the closed unit interval $[0,1]$, with more “similar” fields achieving values closer to unity. Indeed, one can show that $$\mathrm{Jaccard}(i,j) \le \mathrm{Dice}(i,j) \le \mathrm{Ochiai}(i,j) \le \mathrm{Overlap}(i,j) \le 1,$$ with the two inner inequalities holding with equality if and only if $s_i=s_j$, and with all four inequalities holding with equality if and only if $s_i=s_j=c_{ij}$. Thus, two fields have unit similarity precisely when the sets of graduates with degrees in each field coincide.

I compute matrices of Dice, Jaccard, Ochiai and overlap similarities by defining

S <- matrix(rep(diag(C), nrow(C)), nrow = nrow(C))

and exploiting element-wise matrix operations:

dice_mat    <- 2 * C / (S + t(S))
jaccard_mat <- C / (S + t(S) - C)
ochiai_mat  <- C / sqrt(S * t(S))
overlap_mat <- C / pmin(S, t(S))

Ordinal properties

One way to compare similarity measures is to compare how they rank fields from most to least similar. I do so using Kendall’s tau coefficient, which captures the extent to which two rankings agree on the relative positions of ranked entities. Kendall’s tau is defined as $$\tau(r_1,r_2) = \frac{2\times\text{Number of concordant pairs}}{\text{Number of pairs}} - 1,$$ where $r_1$ and $r_2$ are ranking functions, and where a pair $(x,y)$ of entities is “concordant” if $(r_1(x)-r_1(y))$ and $(r_2(x)-r_2(y))$ share the same sign. If every pair is corcordant then $\tau(r_1,r_2)=1$ and if none are concordant then $\tau(r_1,r_2)=-1$. The more $r_1$ and $r_2$ agree on the relative positions of ranked entities, the greater is the number of concordant pairs and hence the larger is $\tau(r_1,r_2)$.

Rearranging the definition of $\tau(r_1,r_2)$ gives $$\Pr(\text{Pair is concordant}) = \frac{\tau(r_1, r_2) + 1}{2}.$$ Thus, computing Kendall’s tau for the rankings produced by each similarity measure, and mapping the results linearly to the unit interval, allows me to estimate the rates of agreement between different measures. I compute these rates as follows, excluding zero and unit similarities, and report the results as a matrix.

similarities <- tibble(
  Dice      = as.vector(dice_mat),
  Jaccard   = as.vector(jaccard_mat),
  Ochiai    = as.vector(ochiai_mat),
  Overlap   = as.vector(overlap_mat),
  `Co-occ.` = as.vector(C)  # Include for comparison
) %>%
  filter(as.vector(upper.tri(C) & C > 0))

similarities %>%
  cor(method = 'kendall') %>%
  {(. + 1) / 2} %>%  # Map to unit interval
  round(3)

##          Dice Jaccard Ochiai Overlap Co-occ.
## Dice    1.000   1.000  0.914   0.778   0.778
## Jaccard 1.000   1.000  0.914   0.778   0.778
## Ochiai  0.914   0.914  1.000   0.864   0.798
## Overlap 0.778   0.778  0.864   1.000   0.765
## Co-occ. 0.778   0.778  0.798   0.765   1.000

The Dice and Jaccard measures produce identical rankings, and both reach about 91% and 78% agreement with the rankings produced using the Ochiai and overlap measures. All four measures produce rankings that reach less than 80% agreement with the ranking produced using co-occurrence counts.

The following table presents the 10 most similar field pairs using the Dice and Jaccard measures, and those pairs’ ranks using the Ochiai, overlap and co-occurrence measures.

Field 1	Field 2	Dice/Jacc. rank	Ochiai rank	Overlap rank	Co-occ. rank
Plant Science And Agronomy	Soil Science	1	1	1	127
Mathematics Teacher Education	Science And Computer Teacher Education	2	3	15	66
Biochemical Sciences	Molecular Biology	3	2	5	56
Ecology	Miscellaneous Biology	4	4	21	146
Mathematics	Physics	5	5	8	11
Political Science And Government	History	6	8	48	2
Journalism	Mass Media	7	9	30	26
Social Science Or History Teacher Education	Language And Drama Education	8	10	43	53
Accounting	Finance	9	12	32	1
Soil Science	Geosciences	10	14	53	1048

Plant Science And Agronomy and Soil Science top the rankings for all four similarity measures, despite being only the 127th most common field pair. Biochemical Sciences and Molecular Biology, and Mathematics and Physics are the only other field pairs that rank in the top 10 most similar across all four measures. Accounting and Finance, the most common field pair, ranks in the top 10 most similar fields using the Dice and Jaccard measures only.

Network properties

Another way to compare similarity measures is to compare properties of the networks they define. Each similarity matrix defines a network in which nodes represent degree fields and in which edges have weight equal to the similarity between incident nodes.

library(igraph)

get_network <- function(adj_mat) {
  adj_mat %>%
    graph.adjacency(mode = 'undirected', weighted = TRUE) %>%
    simplify()  # Ignore self-similarities
}

coocc_net   <- get_network(C)
dice_net    <- get_network(dice_mat)
jaccard_net <- get_network(jaccard_mat)
ochiai_net  <- get_network(ochiai_mat)
overlap_net <- get_network(overlap_mat)

I compare similarity measures by comparing fields’ centralities in each network. I base my analysis on PageRank centrality for a variety of reasons:

Unlike degree-based centrality measures (e.g., degree and strength), PageRank considers the “importance” of each neighbour as well as neighbourhood size;
Unlike distance-based centrality measures (e.g., betweenness and closeness), PageRank doesn’t require solving a bunch of shortest path problems;
Unlike eigenvector centrality, PageRank doesn’t require the underlying network to be strongly connected.¹

I store degree fields’ PageRank centralities as a tibble

pageranks <- tibble(
  Dice      = page_rank(dice_net)$vector,
  Jaccard   = page_rank(jaccard_net)$vector,
  Ochiai    = page_rank(ochiai_net)$vector,
  Overlap   = page_rank(overlap_net)$vector,
  `Co-occ.` = page_rank(coocc_net)$vector
)

and compute the corresponding matrix of Kendall’s tau coefficients, each mapped linearly to the unit interval:

pageranks %>%
  cor(method = 'kendall') %>%
  {(. + 1) / 2} %>%
  round(3)

##          Dice Jaccard Ochiai Overlap Co-occ.
## Dice    1.000   0.999  0.949   0.819   0.824
## Jaccard 0.999   1.000  0.949   0.819   0.823
## Ochiai  0.949   0.949  1.000   0.869   0.839
## Overlap 0.819   0.819  0.869   1.000   0.791
## Co-occ. 0.824   0.823  0.839   0.791   1.000

The rankings of fields from most to least PageRank-central under the Dice and Jaccard measures are almost identical, and reach just over 82% agreement with the ranking produced using co-occurrence counts.

The table below presents the 10 most PageRank-central fields using the Dice measure, and the corresponding ranks using the Jaccard, Ochiai, overlap and co-occurrence measures. The column “Size rank” orders each field from largest to smallest.

Field	Dice rank	Jaccard rank	Ochiai rank	Overlap rank	Co-occ. rank	Size rank
French German Latin And Other Common Foreign Language Studies	1	1	1	9	15	35
Mathematics	2	2	2	6	10	22
Political Science And Government	3	3	3	5	5	10
Mass Media	4	5	11	23	28	50
Molecular Biology	5	4	13	26	53	113
English Language And Literature	6	6	4	4	3	9
History	7	7	9	10	9	15
Economics	8	8	7	7	8	14
Psychology	9	9	5	3	1	3
Sociology	10	10	10	13	12	19

Languages, Mathematics, and Political Science And Government are the most PageRank-central fields under the Dice, Jaccard and Ochiai measures. The Ochiai and overlap measures rank Mass Media and Molecular Biology relatively low on PageRank centrality, possibly due to those fields’ relatively small size. The PageRank centralities produced using co-occurrence counts appear to correlate positively with field size, consistent with my worry that such counts may bias the measurement of intellectual connectedness in favour of larger fields.

Ryan Tibshirani provides excellent notes on how PageRank handles disconnected components and “dangling” nodes. ↩︎

College degrees in the US: Demographics

Mon, 01 Jul 2019 00:00:00 +0000

Each year, the US Census Bureau publishes a set of Public Use Microdata Sample (PUMS) files containing responses to the American Community Survey (ACS). In this post, I use the 2016 ACS PUMS data to explore the variation in educational attainment and degree field choices between demographic groups. The source data are available on GitHub.

Educational attainment

The table below reports educational attainment rates for each sex, pooled across all ages and degree fields. Overall, a randomly selected female is more likely to have a college degree than a randomly selected male. However, fewer females pursue doctoral degrees than males; male graduates are about 1.4 times more likely to have a doctorate than female graduates.

Degree level	% of females	% of males
No college degree	76.95	78.62
Bachelor’s degree	14.66	13.46
Professional or Master’s degree	7.64	6.82
Doctoral degree	0.75	1.10

Pooling across all ages masks variation in educational attainment rates between age groups. I present this variation in the line chart below, which compares educational attainment by age and sex. The chart presents mean age group shares over a rolling five-year window, muting some of the noise in attainment rates caused by random fluctuations between consecutive years of age.

Young females have higher educational attainment rates than young males, but the decline in such rates with age is steeper among females than males. Both sexes experience a spike in attainment between the ages of 60 and 70, corresponding to graduation dates during the late 1960s and early 1970s. This spike could be due to the Higher Education Act of 1965, which “strengthen[ed] the educational resources of [US] colleges and universities” and “provide[d] financial assistance for students in post-secondary and higher education.” The spike is most apparent among males.

Differences in educational attainment could reflect differences in degree field choices. For example, to the extent that (i) there are more male science graduates than female science graduates, and (ii) science graduates tend to pursue doctoral degrees more often than non-science graduates, we would expect to see more doctorates among males than females. If field selection is the only source of differences in educational attainment then there should be no difference in the within-field shares of male and female graduates with post-graduate degrees. I compare such shares in the scatterplots below, in which points correspond to degree fields and have radii proportional to the number of graduates in each field.

The gap between the OLS fitted lines and 45-degree reference lines imply that, on average, male graduates are more likely to hold post-graduate degrees than female graduates in the same field. This discrepancy appears to be larger for doctorates than for other post-graduate degrees.

Degree fields

The bar chart below plots the eight most common degree fields among male and female graduates. Both business and accounting rank among the most common fields for graduates of each sex. Nursing and education are more common among females, while computer science and engineering are more common among males.

The frequency at which people graduate with degrees in different fields may vary over time due to changes in social preferences or labour market conditions. The line chart below plots the shares of graduates who studied electrical engineering or psychology, statified by age and sex. The chart presents mean age group shares over a rolling five-year window.

The trough in male electrical engineering graduates and spike in psychology graduates between the ages of 60 and 70 both coincide with the spike in educational attainment following the Higher Education Act of 1965. The Act may have encouraged males to substitute from electrical engineering (or from not studying) to psychology by changing the relative benefits and costs of becoming qualified in each field. For example, increasing access to federal loans may have encouraged students to pursue degrees with less certain job prospects by delaying the private burden of paying tuition.

The PUMS data report up to two degree fields for each respondent, allowing me to estimate the frequency of field pairings within the US population. For example, the bar chart below shows the fields most frequently paired with economics and mathematics among graduates of each sex. Males economics graduates appear to make similar pairing choices to female economics graduates. Males pair mathematics with physics about as often as with computer science, while females do so only about half as often.

Field pair frequencies provide insight into the intellectual connections between fields. Such connections may reflect fields using similar techniques (e.g., economics and finance) or providing complementary skills (e.g., mathematics and computer science). I explore those connections here and here.

Reading the ministerial diaries

Wed, 12 Jun 2019 00:00:00 +0000

In December 2018, the New Zealand Government announced that its ministers “will for the first time release details of their internal and external meetings.” The Government has since published these “ministerial diaries” as a series of PDFs. In this post, I analyse the ministerial diary of David Parker, a “pivotal cabinet minister” who wears a range of politically and economically significant hats:

Attorney-General;
Minister of Economic Development;
Minister for the Environment;
Minister of Trade and Export Growth;
Associate Minister of Finance.

These roles, coupled with his scheduled activities for the 2018 calendar year being available in a single, consistently formatted table, make Minister Parker’s diary (hereafter “the diary”) an interesting and relatively painless document to analyse.

Parsing the data

I read the diary into R using the pdf_data function from pdftools:

library(pdftools)

path <- "https://www.beehive.govt.nz/sites/default/files/2019-05/October%202017%20-%20December%202018_0.pdf"
pages <- pdf_data(path)

pdf_data scans each page for distinct words, encloses these words in bounding boxes, and stores the coordinates and content of each box as a list of tibbles. For example, the diary’s first page contains the following data:

library(dplyr)

pages[[1]]

## # A tibble: 336 x 6
##    width height     x     y space text    
##    <int>  <int> <int> <int> <lgl> <chr>   
##  1    46     20    72    75 TRUE  David   
##  2    52     20   122    75 TRUE  Parker  
##  3    42     20   179    75 TRUE  Diary   
##  4    77     20   226    75 FALSE Summary 
##  5    11     11    72   102 TRUE  26      
##  6    36     11    85   102 TRUE  October 
##  7    22     11   124   102 TRUE  2017    
##  8     3     11   149   102 TRUE  -       
##  9    11     11   155   102 TRUE  31      
## 10    46     11   168   102 TRUE  December
## # … with 326 more rows

The x and y columns provide the horizontal and vertical displacement, in pixels, of each bounding box from the top-left corner of the page. The left-most boxes sit 72 pixels from the left page boundary, allowing me to identify table rows by the cumulative number of boxes for which x equals 72.

pages[[1]] %>%
  arrange(y, x) %>%
  mutate(row = cumsum(x == 72)) %>%
  filter(cumsum(x == 72 & text == "Date") > 0)  # Remove preamble

## # A tibble: 91 x 7
##    width height     x     y space text         row
##    <int>  <int> <int> <int> <lgl> <chr>      <int>
##  1    21     11    72   355 FALSE Date          14
##  2    46     11   149   355 TRUE  Scheduled     14
##  3    22     11   198   355 FALSE Time          14
##  4    37     11   235   355 FALSE Meeting       14
##  5    38     11   390   355 FALSE Location      14
##  6    21     11   504   355 FALSE With          14
##  7    39     11   630   355 FALSE Portfolio     14
##  8    53     11    72   382 FALSE 26/10/2017    15
##  9    25     11   149   382 TRUE  11:00         15
## 10     3     11   177   382 TRUE  -             15
## # … with 81 more rows

The x values for which row equals 14 provide the left alignment points for the text in each of the diary’s six columns. These points remain unchanged across all 84 pages, allowing me to identify rows and columns throughout the diary within a single pipe:

library(tidyr)

# Define column names and left alignment points
columns <- tibble(
  left_x = c(72, 149, 235, 390, 504, 630),
  name = c("date", "scheduled_time", "meeting", "location", "with", "portfolio")
)

# Identify page numbers
for (i in 1 : length(pages)) pages[[i]]$page <- i

# Process data
diary <- bind_rows(pages) %>%
  # Identify table rows
  arrange(page, y, x) %>%
  mutate(row = cumsum(x == columns$left_x[1])) %>%
  filter(cumsum(x == columns$left_x[1] & text == "Date") == 1) %>%
  filter(row > min(row)) %>%  # Remove header row
  # Identify table columns
  mutate(column = sapply(x, function(x){max(which(columns$left_x <= x))}),
         column = columns$name[column]) %>%
  # Concatenate text within table cells
  group_by(row, column) %>%
  summarise(text = paste(text, collapse = " ")) %>%
  ungroup() %>%
  # Clean data
  clean_data() %>%
  # Convert to wide format
  mutate(column = factor(column, levels = columns$name)) %>%
  spread(column, text) %>%
  select(-row)

I define the clean_data function in the appendix below.

The resulting tibble diary contains 1,553 rows, each of which describes a unique entry scheduled between October 2017 and December 2018. I select entries scheduled during the 2018 calendar year:

(data <- filter(diary, grepl("2018", date)))

## # A tibble: 1,347 x 6
##    date    scheduled_time meeting          location    with         portfolio   
##    <chr>   <chr>          <chr>            <chr>       <chr>        <chr>       
##  1 15/01/… 10:00 - 11:00  Meeting with Fi… Beehive     Treasury of… Associate F…
##  2 15/01/… 14:00 - 14:30  Meeting with MF… Beehive     MFAT offici… Trade and E…
##  3 15/01/… 15:00 - 15:30  Meeting with MB… Beehive     MBIE offici… Economic De…
##  4 16/01/… 09:30 - 10:15  Meeting with En… Selwyn      Environment… Environment 
##  5 16/01/… 10:40 - 11:40  Meeting with Ng… Springston  Ngai Tahu r… Environment 
##  6 16/01/… 12:00 - 12:30  Meeting with fa… Canterbury  Farm owners… Environment 
##  7 16/01/… 12:40 - 13:40  Working Lunch w… Canterbury  Te Waihora … Environment 
##  8 16/01/… 13:50 - 14:45  Meeting with fa… Leeston     Farm owners… Environment 
##  9 16/01/… 16:30 - 17:30  Meeting with Sy… Middleton,… Syft Techon… Economic De…
## 10 17/01/… 09:30 - 10:00  Meeting with Ca… Beehive     Cabinet Off… All         
## # … with 1,337 more rows

According to the official disclaimer, the diary excludes personal and party political meetings, along with details published elsewhere such as time spent in the House of Representatives. Moreover, some details are withheld under various sections of the Official Information Act. I assume that the remaining entries provide a representative sample of Minister Parker’s ministerial activities.

Analysing word frequencies

I analyse the frequency of words used in the with column of data. These frequencies provide insight into Minister Parker’s interactions with different organisations. I use the unnest_tokens function from tidytext to identify unique words and the count function from dplyr to count word frequencies.

library(tidytext)

data %>%
  unnest_tokens(word, with) %>%
  anti_join(get_stopwords()) %>%  # Remove stop words
  count(word, sort = TRUE)

## # A tibble: 674 x 2
##    word          n
##    <chr>     <int>
##  1 attending   290
##  2 officials   272
##  3 minister    198
##  4 ministers   108
##  5 mfe          89
##  6 mbie         82
##  7 jones        76
##  8 sage         58
##  9 twyford      56
## 10 mfat         53
## # … with 664 more rows

The most frequent word, “attending,” reflects cabinet meetings, media briefings and other general ministerial duties. The next most frequent word, “officials,” reflects Minister Parker’s meetings with the Ministry for the Environment (MfE), the Ministry of Business, Innovation and Employment (MBIE), and the Ministry of Foreign Affairs and Trade (MFAT), along with other government departments. Both “minister” and “ministers” reflect meetings with Ministers Jones, Sage, Twyford and others.

Computing tf-idf scores

Counting word frequencies across all portfolios masks portfolio-specific interactions. I infer such interactions from the term frequency-inverse document frequency (tf-idf) scores of word-portfolio pairs. I identify these pairs as follows.

word_portfolio_pairs <- data %>%
  # Disambiguate portfolio names
  mutate(portfolio = gsub("Att.*?ral|AG", "Attorney-General", portfolio)) %>%
  # Split entries with multiple porfolios
  mutate(portfolio = gsub("[^[:alpha:] -]", "&", portfolio),
         portfolio = strsplit(portfolio, "&")) %>%
  unnest() %>%
  mutate(portfolio = trimws(portfolio)) %>%
  # Identify word-portfolio pairs
  filter(!is.na(portfolio)) %>%
  unnest_tokens(word, with) %>%
  select(word, portfolio)

tf-idf scores measure the “importance” of words in each document in a corpus. The term frequency

$$\mathrm{tf}(w, d)=\frac{\text{Number of occurrences of word}\ w\ \text{in document}\ d}{\text{Number of words in document}\ d}$$

measures the rate at which word $w$ occurs in a document $d$, while the inverse document frequency

$$\mathrm{idf}(w) = -\ln\left(\frac{\text{Number of documents containing word}\ w}{\text{Number of documents}}\right)$$

provides a normalisation factor that penalises ubiquitous words. The tf-idf score

$$\text{tf-idf}(w,d) = \mathrm{tf}(w, d) \cdot \mathrm{idf}(w)$$

thus measures the prevalence of word $w$ in document $d$, normalised by that word’s prevalence in other documents. I interpret the set of entries associated with each portfolio as a document and use the bind_tf_idf function from tidytext to compute word-portfolio tf-idf scores:

word_portfolio_pairs %>%
  count(word, portfolio) %>%
  bind_tf_idf(word, portfolio, n)

## # A tibble: 1,066 x 6
##    word        portfolio                   n       tf   idf   tf_idf
##    <chr>       <chr>                   <int>    <dbl> <dbl>    <dbl>
##  1 a           Associate Finance           1 0.00285  0.693 0.00197 
##  2 a           Environment                 1 0.000739 0.693 0.000512
##  3 a           Trade and Export Growth     2 0.00277  0.693 0.00192 
##  4 accelerator Economic Development        1 0.00137  1.79  0.00245 
##  5 acting      Attorney-General            1 0.00215  1.79  0.00385 
##  6 action      Trade and Export Growth     1 0.00139  1.79  0.00248 
##  7 adrian      Economic Development        1 0.00137  1.79  0.00245 
##  8 advisory    Economic Development        2 0.00273  0.693 0.00189 
##  9 advisory    Environment                 1 0.000739 0.693 0.000512
## 10 advisory    Trade and Export Growth     1 0.00139  0.693 0.000960
## # … with 1,056 more rows

The idf column identifies both language-specific stop words (e.g., “a”) and context-specific stop words (e.g., “advisory”) that are common across portfolios.

The chart below presents the highest tf-idf words for each portfolio. These words reveal organisations (e.g., the Parliamentary Counsel Office) and individuals (e.g., Cecilia Malmström) that are missing from the diary-wide word frequencies computed above.

The chart also reveals which interactions correspond to which portfolios. For example, Minister Parker’s frequent interactions with MBIE officials appear to be most associated with the Economic Development portfolio, while his interactions with Minister Sage appear to involve both the Environment and Associate Finance portfolios. (Minister Sage’s diary suggests that such cross-portfolio interactions relate to the Overseas Investment Office, for which Ministers Parker and Sage are jointly responsible.)

Acknowledgements

The pdftools 2.0 release notes helped me interpret pdf_data's output. Julia Silge and David Robinson‘s book Text Mining with R provided useful background reading, especially the chapter on tf-idf scores.

Appendix

Source code for `clean_data()`

clean_data <- function (df) {
  df %>%
    # Replace non-ASCII characters with ASCII equivalents
    mutate(text = iconv(text, "", "ASCII", sub = "byte"),
           text = gsub("<c3><a7>", "c", text),
           text = gsub("<c3><a9>", "e", text),
           text = gsub("<c3><b1>", "n", text),
           text = gsub("<c4><81>", "a", text),
           text = gsub("<c5><ab>", "u", text),
           text = gsub("<e2><80><93>", "-", text),
           text = gsub("<e2><80><99>", "'", text),
           text = gsub("<e2><80><9c>|<e2><80><9d>", "\"", text)) %>%
    # Fix linebroken data ranges
    spread(column, text) %>%
    mutate(split_date = is.na(scheduled_time) & grepl("-", paste(date, lag(date))),
           row = cumsum(!split_date)) %>%
    select(-split_date) %>%
    gather(column, text, -row) %>%
    group_by(row, column) %>%
    summarise(text = gsub("NA", "", paste(text, collapse = " "))) %>%
    ungroup() %>%
    mutate(text = trimws(text),
           text = ifelse(text == "", NA, text)) %>%
    # Fix transcription errors
    mutate(text = gsub("Minster", "Minister", text),
           text = ifelse(column == "portfolio" & text == "Minister Little", "Attorney-General", text))
}

Relatedness, complexity and local growth

Tue, 02 Apr 2019 00:00:00 +0000

I recently wrote an article for Asymmetric Information summarising my paper with Dave Maré on the relatedness and complexity of economic activities in New Zealand. The full text for that article is quoted below.

Introduction

Current European regional policy encourages regions to build on their strengths by diversifying into activities that draw upon existing knowledge bases. This “smart specialisation” approach encourages entrepreneurship, innovation and long-term growth by fostering local interactions between workers with complementary knowledge and skills.

Balland et al. (2018) define a framework for analysing smart specialisation using the ideas of relatedness and complexity. Expanding into activities that are related to existing specialisations carries low growth risk because local workers already possess the knowledge and skills needed to conduct those activities. Expanding into complex activities delivers the highest expected economic returns because such activities “form the basis for long-run competitive advantage.” Balland et al.‘s framework identifies low-risk, high-return development opportunities as locally under-represented activities with high local relatedness and high complexity.

We examine the contribution of relatedness and complexity to urban employment growth in New Zealand. This allows us to evaluate the efficacy of implementing smart specialisation policies in New Zealand by identifying whether the associated mechanisms appear to influence employment dynamics.

Data and methods

Our analysis uses historical New Zealand census data aligned to current industry, occupation and urban area codes. We select 50 “cities” (urban areas) and 200 “activities” (industry-occupation pairs) with persistently high employment in census years 1981, 1991, 2001 and 2013. Our selected activities span 61 industries and nine occupations.

We recognise activities as being “related” if they require similar inputs. We infer such similarities from employee co-location patterns. These patterns reveal firms’ shared preferences for using spatially heterogeneous resources, which encourage firms engaged in related activities to co-locate in order to benefit from agglomeration economies.

We measure activities’ relatedness using weighted correlations of local employment shares. Our approach extends discrete measures used in previous studies by recognising variation in the extent of local specialisation and by adjusting for differences in employment data quality between geographic areas.

We recognise activities as being “complex” if they rely on specialised combinations of complementary inputs. For example, consulting is more complex than lecturing because consultants need local clients while lecturers do not rely as much on other activities being present locally.

We define activity complexity using the second eigenvector of the row-standardised activity relatedness matrix. Our approach generalises Calderelli et al.‘s (2012) eigenvector approximation of Hidalgo and Hausmann’s (2009) Method of Reflections. We use a similar approach, applied to the transpose of the city-activity employment matrix, to estimate city complexity.

Mapping relatedness

We define an “activity space” that captures the network structure of activities based on our relatedness estimates. We describe activity space by a weighted network in which nodes correspond to activities and in which edges have weight equal to the relatedness between pairs of activities. The subnetwork induced by the 500 edges of largest weight is shown below, with nodes coloured by occupation.

At the centre of our map is a tightly connected, nest-shaped cluster of low-skill occupations in the distributive services sector. To the right of this cluster is a group of medium- to low-skill occupations in the construction, retail and healthcare sectors. These activities are ubiquitous and appear together as local relative specialisations in smaller, less diverse cities. In contrast, the lower wing of our network map comprises a cluster of high-skill occupations in the professional and information service sectors, which tend to concentrate in large cities and to have higher levels of complexity.

Do relatedness and complexity predict employment growth?

More complex activities grew faster during our period of study. On average and holding local relatedness constant at its weighted mean value, a one standard deviation increase in activity complexity is associated with a 0.89 percentage point increase in local employment growth per year. This effect rises to 0.98 percentage points when we control for city complexity. More locally related activities experienced slower growth, especially in complex cities.

Balland et al.‘s (2018) framework suggests that complex activities with high local relatedness offer the strongest prospects for future growth. If this were true then we would expect a strong positive coefficient on the interaction of local relatedness and activity complexity. Our estimates show only a weak and insignificant interaction.

Relatedness appears to promote growth only in the largest and most complex cities. This result is consistent with the idea that cities are dense networks of interacting activities: the benefits of such interaction are more apparent in larger cities, where workers and firms engaged in related activities interact more frequently.

Conclusion

Complex activities grew faster during our period of study, especially in complex cities. However, this growth was not significantly stronger in cities more dense with related activities. Overall, we do not identify strong effects of relatedness and complexity on growth in local activity employment. It remains an open question whether the effects do not operate or whether New Zealand cities lack the scale for such operation.

Further details are available in Motu Working Paper 19-01.

Accessing the Strava API with R

Sun, 06 Jan 2019 00:00:00 +0000

Strava is an online platform for storing and sharing fitness data. Strava provides an API for accessing such data at the activity (e.g., run or cycle) level. This post explains how I authenticate with, and extract data from, the Strava API using R. I implement my method in the R package stravadata.

Setup and authentication

Strava uses OAuth 2.0 to authorise access to the API data. The first step to becoming authorised is to register for access on Strava’s API settings page. I put “localhost” in the “Authorization Callback Domain” field. Upon completing the registration form, the page provides two important values: an integer client ID and an alpha-numeric client secret. I store these values in credentials.yaml, which I structure as

client_id: xxxxxxxxx
secret: xxxxxxxxx

and import into R using the read_yaml function from the yaml package.

Next, I create an OAuth application for interacting with the API and an endpoint through which to send authentication requests. I use the oauth_app and oauth_endpoint functions from httr:

library(httr)

app <- oauth_app("strava", credentials$client_id, credentials$secret)
endpoint <- oauth_endpoint(
  request = NULL,
  authorize = "https://www.strava.com/oauth/authorize",
  access = "https://www.strava.com/oauth/token"
)

Finally, I create an OAuth access token to send the authentication request to my Strava account. This token encapsulates the application and endpoint defined above. Running¹

token <- oauth2.0_token(endpoint, app, as_header = FALSE,
                        scope = "activity:read_all")

opens a browser window at a web page for accepting the authentication request. Doing so redirects me to the callback domain (“localhost”) and prints a confirmation message:

Authentication complete. Please close this page and return to R.

Extracting the data

After authenticating with Strava, I use HTTP requests to extract activity data from the API. The API returns multiple pages of data, each containing up to 200 activities. I use a while loop to iterate over pages, using the fromJSON function from jsonlite to parse the extracted data:

library(jsonlite)

df_list <- list()
i <- 1
done <- FALSE
while (!done) {
  req <- GET(
    url = "https://www.strava.com/api/v3/athlete/activities",
    config = token,
    query = list(per_page = 200, page = i)
  )
  df_list[[i]] <- fromJSON(content(req, as = "text"), flatten = TRUE)
  if (length(content(req)) < 200) {
    done <- TRUE
  } else {
    i <- i + 1
  }
}

Finally, I use the rbind_pages function from jsonlite to collate the activity data into a single data frame:

df <- rbind_pages(df_list)

Strava’s OAuth update in October 2019 made scope specification a requirement. ↩︎

Guest appearances on The Joe Rogan Experience

Wed, 26 Sep 2018 00:00:00 +0000

The Joe Rogan Experience (JRE) is a podcast hosted by comedian and mixed martial arts (MMA) commentator Joe Rogan. In this post, I analyse the relationship between JRE guest appearances and popularity using data from Google Trends. I find that guests typically experience a spike in popularity immediately after appearing on the podcast.

The data used in my analysis are available here.

Collecting the data

I scrape the JRE podcast directory for a list of episode dates, numbers and titles. The directory comprises a multi-page table that is dynamically updated using HTTP requests. I use this method to emulate such requests, allowing me to iterate over table pages and extract the raw episode metadata. I clean these data by

removing non-standard episodes (such as MMA Shows and Fight Companions),
fixing any missing, incorrect or duplicate episode numbers, and
removing non-ASCII characters from episode titles.

The resulting file contains clean metadata for JRE episodes #1 through #1172. I use these data to create a list of guests that appear in each episode, making several manual adjustments that correct for inconsistent or missing guest names.¹

The barchart below plots the number of episodes, unique guests and first appearances by year for 2010 through 2018. On average, the number of JRE episodes and guests increased each year, although the proportion of guests appearing on the show for the first time appears to be falling.

Estimating popularity

I infer guests’ popularity from Google Trends data on web searches in the United States. These data index the proportion of total Google search queries attributable to particular keywords. Google Trends provides data on a 0–100 scale, where 100 denotes the maximum search interest for the corresponding keyword in a given period and locale.²

I collect Google Trends data for each identified JRE guest and for Joe himself. My data provide weekly estimates of individuals’ online popularity for the five years beginning September 2013. I assume that these data are unbiased estimates of guests’ actual popularity.

The chart below plots Joe’s estimated popularity during my sample period. Web search interest for the phrase “Joe Rogan” more than doubled between September 2013 and September 2018. The spike during the first week of September 2018 marks JRE episode #1169 with Elon Musk.

Identifying popularity spikes

I align JRE guest appearance dates with my Google Trends data in order to determine whether such appearances coincide with popularity spikes. I identify spikes as large, sudden deviations in search interest from its mean value. I allow this mean to change over time by defining a moving average (MA) series, which I subtract from the actual interest series in order to construct a demeaned series that captures the idosyncratic variation in guests’ popularity.³

For example, the chart below plots the actual, moving average and demeaned search interest series for Dave Rubin—political commentator and host of The Rubin Report—who appeared on The Joe Rogan Experience in the three weeks identified by the dashed vertical lines. Dave’s gradual rise in popularity since late 2015 is punctuated by three spikes in search interest that coincide with his JRE appearances.

I construct the demeaned search interest series for each guest who appears on The Joe Rogan Experience during my sample period. I standardise each of these series to have zero mean and unit variance across the entire sample period in order to make the series comparable. The distributions of guests’ standardised demeanded search interest in the weeks surrounding their appearances are shown below.

In the two weeks prior to appearing on The Joe Rogan Experience, guests’ popularities are centred about a standard deviation below their MA trend value, reflecting a rise in that value due to an impending upward shock. Appearances coincide with a shift in probability density towards positive deviations from local means. Traces of this shift disappear after about three weeks, at which time the distribution of standardised demeaned search interest mimics that observed five weeks prior. These dynamics suggest that, on average, JRE guests experience an increase in popularity during the week in which they appear on the podcast.

Detecting spikes in real-time

I obtain more rigorous results using this real-time spike detection algorithm. The algorithm builds a filtering series alongside the actual search interest series, and computes a rolling mean and standard deviation for the filtering series over the previous lag observations. Spikes correspond to values in the actual series that deviate from the filtering mean by some threshold number of standard deviations. A third parameter influence controls how sensitive the filtering series is to spikes.

The real-time algorithm defines a signal series that denotes super-threshold deviations above and below the filtering mean by 1 and -1, respectively, and sub-threshold deviations by 0. Positive signals identify spikes in search interest relative to recent trends. The rate at which such signals coincide with JRE guest appearances offers insight into whether such appearances herald popularity spikes.

For example, the chart below plots the actual, filtering and signal series for Dave Rubin’s estimated popularity during my sample period, along with the dates of his three JRE appearances. I compute the filtering means and standard deviations with lag equal to 12, and set the filtering threshold at two standard deviations from the filtering mean. Positive signals register when the actual series deviates above the grey band.

The real-time algorithm identifies spikes coincident with each of Dave’s appearances on The Joe Rogan Experience. However, it also identifies false positives that reflect other sources of sudden popularity booms.

I compute the empirical probability that the real-time algorithm detects a spike in guests’ popularity conditional upon their appearing on The Joe Rogan Experience in the same or previous week.⁴ The table below reports this probability for a range of lag and threshold values, and with influence equal to 0.5.⁵

Pr(Spike \| Appears)	`lag = 3`	`lag = 6`	`lag = 9`	`lag = 12`
`threshold = 1`	0.940	0.923	0.905	0.892
`threshold = 2`	0.896	0.866	0.845	0.824
`threshold = 3`	0.837	0.808	0.771	0.748
`threshold = 4`	0.791	0.753	0.725	0.696

Increasing lag or threshold lowers the detection rate, indicating that the real-time algorithm is more likely to identify guest appearances when it is more adaptive and less picky. The negative relationship between detection rate and lag (with threshold held constant) suggests that, on average, guests’ popularities are more volatile over longer horizons: the further back you look in search history, the more likely you are to remember shocks and so the larger new shocks must be to seem uncommon.

Conclusion

In general, appearing on the The Joe Rogan Experience seems to coincide with a spike in popularity as measured by web search interest. This result is robust to varying the definition of “spike,” at least along the dimensions of the lag and threshold parameters used by the real-time detection algorithm.

While suggestive, my analysis is not causal because I do not compare my results with the counterfactual scenario in which treatments (i.e., JRE appearances) do not occur. The false positives identified by the real-time algorithm are reminders that my results may be driven by other confounding factors.

It would be useful to compare guests’ popularity dynamics near JRE appearances with those near appearances on other fora. This comparison would help me separate the effect of increased online presense in general from the effect of appearing on The Joe Rogan Experience in particular, and may thereby provide stronger hints at causality.

I exclude Brian Redban’s appearances prior to episode #674, when he returned as a guest for the first time after producing and co-hosting the show until late 2013. ↩︎
Google Trends’ FAQ does not identify how the raw search proportions get mapped to [0, 100]. I assume that the map is linear so that, for example, an increase from 25 to 50 and from 50 to 100 both constitute a doubling in popularity. ↩︎
I use an MA order of seven. Thus, each observation in the moving average series is equal to the mean value over the two surrounding months in the actual series. This choice seems to optimally suppress the impact of spikes on local means. ↩︎
Google Trends provides data in weekly intervals with weeks starting on Saturdays. I include lagged weeks in the detection criterion to allow for latency between JRE episode transmission and audience response. For example, the web search activity attributable to an episode aired on a Friday may not occur until the Saturday that begins the following week. ↩︎
I obtain similar patterns with influence equal to 0.3 and 0.7. ↩︎

Coauthorship networks at Motu

Thu, 21 Jun 2018 00:00:00 +0000

Earlier this year I joined Motu, an economic and public policy research institute based in Wellington, New Zealand. In this post, I analyse the coauthorship network among Motu researchers based on working paper publications. The data used in my analysis are available here.

Collecting and preparing the data

Bibliographic data are notoriously uncooperative. Changes in author or institution names make it difficult to uniquely identify researchers across time, reducing data consistency and completeness. Moreover, most bibliographic databases charge an access fee that discourages casual exploration. Fortunately, Motu’s working paper directory is presented in a consistent format that makes it amenable to web scraping free of charge.

The R script data.R scrapes the directory for a list of working paper IDs and URLs. Each URL points to a landing page for the corresponding paper, which I scrape for a list of authors. I include only those authors with outgoing hyperlinks because

the hyperlinked URL provides a unique and persistent author ID, and
it is much easier to perform a regular expression search for <a href="(.*?)"> than to distinguish different uses of commas case-by-case.

The resulting file authors.csv contains each unique author-paper pair. It excludes the authors of five papers for which either (i) there is no landing page linked from the main directory or (ii) the landing page has no authors with outgoing hyperlinks.

I read in authors.csv and two other tables: areas.csv, which contains the name, ID and ambient colour for each of Motu’s six primary research areas; and papers.csv, which links each paper to its research area. I merge these data into a single tibble data:

library(dplyr)

data <- authors %>%
  left_join(papers) %>%
  left_join(areas)

The authorship network

I next construct an authorship network by pairing papers with their authors using the information contained in data. I achieve this by defining an author-paper incidence matrix

incidence <- table(data$author, data$paper)

and using that matrix to create a bipartite network bip:

library(igraph)

bip <- graph.incidence(incidence)

The authorship network bip contains 74 authors who collectively wrote 232 working papers over the 2003–2018 sample period. Those papers are distributed across Motu’s research areas as shown in the chart below.

The variation in working paper counts reflects the variation in areas’ tenure within Motu’s research portfolio. Environment and Resources, contributing 67 working papers, has been around since the series began; Human Rights, appearing only once in the series, is a relatively new research area for Motu.

The authorship network bip is drawn below using Fruchterman and Reingold’s (1991) force-directed algorithm. Squares denote working papers and are coloured by research area. Each circle denotes an author and is scaled according to the number of working papers (co)written by that author.

A striking feature of bip is the presence of three high-degree vertices, or hubs, each representing an author of at least 48 working papers. These hubs are shaded in the map of bip shown above. Another feature is the variation in area diversity within authors’ individual corpuses. Urban and Regional authors tend to also write papers on Wellbeing and Macroeconomics, while Environment and Resources authors are more specialised.

The coauthorship network

Projecting bip onto the set of authors yields a coauthorship network in which two authors are adjacent if they have written a paper together. I define such a projection via

net <- bipartite.projection(bip)[[1]]

I use the jaccard function described in my previous post to determine the similarity between two authors from their authorship counts. According to this measure, maximally similar authors always write together while maximally dissimilar authors never write together. Again, I use the Fruchterman-Reingold algorithm for distributing vertices in the plane. The resulting map of net is shown below.

The coauthorship network is sparse, containly only 168 (about 6%) of the 2,701 possible edges between its 74 vertices. However, the largest connected component (LCC) of net contains all but six authors, two of whom write exclusively with each other and the remaining four having zero coauthors. Such connectivity is facilitated by the three shaded hubs identified above.

Hints of small-worldness

The sparsity of net implies that most pairs of authors aren’t coauthors. Indeed, the probability that two randomly selected authors are coauthors is given by net's edge density: about 0.06. However, it is not unusual for two randomly selected authors to share a common coauthor; within the LCC of net, the probability of such an event is about 0.46. I calculate this probability by examining the distribution of (unweighted) geodesic distances between the vertices in net and determining the proportion of vertex pairs that are distance two apart. The following function performs that calculation for an arbitrary connected graph G.

common_neighbour_rate <- function (G) {
  B <- distances(G, weights = rep(1, gsize(G))) == 2
  num_pairs <- choose(gorder(G), 2)
  rate <- (sum(B) / 2) / num_pairs  # Mean within upper right triangle
  return (rate)
}

The function common_neighbour_rate works by computing the geodesic distances between each pair of vertices in G, defining binary indicator variables (as entries of the matrix B) for whether each distance is equal to two and taking the average of those variables over all possible vertex pairs. Its name comes from recognising that “coauthor” is a context-specific synonym for “neighbouring vertex.”

Within the LCC of net, the average distance between any two authors is equal to 2.5 while the maximum such distance—the diameter of the LCC—is equal to five. These numbers suggest a smallness about the world inhabited by Motu working paper authors: if you ask anyone if they’ve written a paper with so-and-so, the answer you’ll get is probably, “no, but I’ve written with someone who has written with someone that has.” It appears that, at least in terms of geodesic distances, Motu researchers are seldom far apart.

Testing for small-worldness

Watts and Strogatz (1998) formalise the idea of small-worldness.¹ They identify small-world networks as those that are

highly clustered … yet have small characteristic path lengths.

The extent to which a network is clustered is determined by its clustering coefficient, while the characteristic path length is simply the mean geodesic distance between pairs of vertices. Intuitively, a network is small-world if it has local communities whose links are mostly internal but with a few external links that facilitate fast inter-community exchange. For example, most flights undertaken by New Zealanders comprise travel within our dense domestic network, but a Cantabrian wanting to holiday in Bangkok or Dubai need only make a pitstop in Sydney. The latter acts as a hub that connects many distant cities in the same way that the three shaded vertices in the map of net above connect many otherwise distant authors.

Humphries and Gurney (2008) describe a method for determining small-worldness using random graphs. Their strategy is to compare the clustering coefficient and mean distance between vertices in a network to the expected value of those attributes if edges are randomly distributed. Concretely, they state that

A network with n nodes and m edges is a small-world network if it has a similar path length but greater clustering of nodes than an equivalent Erdös-Rényi random graph with the same n and m.

The Erdös-Rényi model is a simple method of generating random graphs with a fixed number of vertices and edges, the latter being placed between vertex pairs with uniform probability and without duplication. Such graphs tend to have short mean distances because edges are as likely to traverse the network and bridge communities as they are to consolidate an already tight local community. Likewise, random edge assignment disregards community formation, causing Erdös-Rényi graphs to have small clustering coefficients.

The function below computes the clustering coefficient (known to igraph users as transitivity) and characteristic path length for a sample of Erdös-Rényi random graphs that are equivalent to an arbitrary graph G. The sample means of these attributes provide baselines against which to measure the corresponding values observed from G.

small_world_baselines <- function (G, sample_size = 1000, seed = 0) {
  set.seed(seed)
  transitivity_samples <- rep(0, sample_size)
  mean_distance_samples <- rep(0, sample_size)
  for (i in 1 : sample_size) {
    er <- sample_gnm(gorder(G), gsize(G))
    transitivity_samples[i] <- transitivity(er)
    mean_distance_samples[i] <- mean_distance(er, directed = FALSE)
  }
  return (list(transitivity = mean(transitivity_samples),
               mean_distance = mean(mean_distance_samples)))
}

The coauthorship network net has clustering coefficient 0.24 and mean distance 2.49, with baseline comparators of 0.06 and 2.96. Thus, net is about four times as clustered as is expected for a network with its density and has slightly shorter geodesic distances than would be obtained by allocating edges randomly. These facts positively indicate small-worldness, and reflect widespread collaboration between authors within and between research areas.

Humphries and Gurney define a small-world coefficient by taking the ratio of observed and expected clustering coefficients, and dividing the result by the ratio of observed and expected mean distances. This quotient is larger than one for small-world networks. The coauthorship network net obtains a small-world coefficient of 4.67, thereby passing the Humphries-Gurney small-worldness test.

Subsampling by research area

Finally, I analyse the coauthorship network within Motu’s five largest research areas. I filter the working papers from data that correspond to each area and recompute several statistics mentioned earlier using the subsample data. The first set of statistics is shown in the table below.

Area	Papers	Authors	Edge density	LCC order	LCC diameter
Environment and Resources	67	37	0.08	29	3
Population and Labour	56	29	0.13	26	4
Urban and Regional	50	32	0.09	29	4
Wellbeing and Macroeconomics	35	19	0.13	14	2
Productivity and Innovation	23	18	0.20	17	4

Environment and Resources boasts the largest number of authors as well as working papers. However, it has the least dense coauthorship network, containing only 8% of all possible edges. The Productivity and Innovation coauthorship network is the most dense. The largest connected component of the Wellbeing and Macroeconomics coauthorship network is the smallest among the five areas; however, every pair of authors within its LCC are coauthors or share a common coauthor.

I also test each area’s coauthorship network for small-worldness using the Humphries-Gurney procedure. The results are tabulated below.

Area	Clustering coefficient (baseline)	Mean distance (baseline)	Small-world coefficient
Environment and Resources	0.25 (0.08)	1.93 (3.16)	5.13
Population and Labour	0.33 (0.13)	2.15 (2.56)	3.01
Urban and Regional	0.17 (0.09)	2.13 (3.04)	2.71
Wellbeing and Macroeconomics	0.19 (0.11)	1.77 (2.88)	2.76
Productivity and Innovation	0.39 (0.19)	2.24 (2.31)	2.17

All five areas have small-world coefficients greater than one, and therefore satisfy Humphries and Gurney’s criterion. However, the ratio of observed and baseline clustering coefficients is not as large in any area as it is in the full coauthorship network. Moreover, only two areas have mean distances close to those expected in an equivalent Erdös-Rényi random graph. The best candidate for a small world—that is, a world with high clustering and as-random geodesic distances—is the Productivity and Innovation coauthorship network, despite it having the lowest small-world coefficient.

I suspect that network size adds considerable noise to these estimates. Even the full coauthorship network net is barely large enough to exhibit any global structure that can be distinguished from randomness. Applying the Humphries-Gurney test to a larger network, or implementing a more robust procedure such as that proposed by Telesford et al. (2011), may yield cleaner results.

Note: I updated this post on July 28, 2019 after revising the source data. My results changed slightly due to retroactive author (re)assignments.

The linked article is locked behind a paywall. However, Strogatz hosts a free copy on his website. ↩︎

Habitat choices of first-generation Pokémon

Thu, 01 Mar 2018 00:00:00 +0000

In this post, I use R’s igraph package to analyse the cohabitation network among wild Pokémon species. The underlying data come from the GitHub repository behind veekun.

Matching species with their habitats

I infer habitats from random encounter events in the international versions of Pokémon Red, Blue and Yellow.¹ I store these events in a data frame named encounters. Each encounter has three attributes: the location, the species encountered and that species’ primary type. I use these data to generate a species-location incidence matrix:

habits <- table(encounters$species, encounters$location)

The rows and columns of habits count where species habitate. For example, summing the rows of habits yields the number of unique habitats for each species. I store these sums as follows:

pokemon <- tibble(species = rownames(habits), ubiquity = rowSums(habits))

Goldeen, Magikarp and Poliwag are the most ubiquitous species. Each habitate in 24 unique locations across the Kanto region.

The boxplots below show the distribution of ubiquity by species’ primary type. Water-types have the highest median ubiquity, closely followed by Grass- and Normal-types. Species with Dragon, Fairy or Ghost as their primary type each habitate in a single location.

The column sums of habits count the number of unique species that habitate in each location. I store these sums as follows:

locations <- tibble(name = colnames(habits), diversity = colSums(habits))

I compute the mean value of diversity across the locations in which each species habitates via

pokemon$mean_diversity <- colSums(t(habits) * locations$diversity) / pokemon$ubiquity

ubiquity and mean_diversity share a correlation coefficient of about -0.22, suggesting that they share a weak negative relationship. Thus, on average, more ubiquitous species tend to live in less diverse locations. However, this relationship is skewed by a large number of species that cohabitate in one or two locations as shown in the chart below.

The chart plots mean_diversity against ubiquity, along with the least-squares line of best fit.² The top-left cluster comprises species that exclusively habitate inside Cerulean Cave or the Kanto Safari Zone. This cluster has a strong positive effect on mean_diversity among species with low ubiquity values, driving the negative relationship between the two attributes.

The cohabitation network

Species reveal their preference toward spending time with each other through their choice of whether to share habitats. The more frequently two species cohabitate, the stronger is their implied social connection. The number of locations in which two species cohabitate is equal to the cross product of the two corresponding rows of habits. I store these counts in a symmetric species-species adjacency matrix:

cohabits <- habits %*% t(habits)

Each entry cohabits[i, j] is equal to the number of locations in which species i and j cohabitate, and each diagonal entry cohabits[i, i] is equal to the ubiquity of species i.

The raw cohabitation counts are an imperfect measure of the strength of the social ties between species. For example, ubiquitous species tend to have higher cohabitation counts with all other species and so appear to be more social. However, having many social connections may indicate that a species “spreads itself thin” and that each of its connections are actually quite weak. Strong connections arise when two species spend lots of their time together and little of their time apart.

The Jaccard index provides a convenient measure of the tendency for two species to spend most of their time in each others’ company. The index counts the number of locations in which two species cohabitate as a proportion of the locations in which at least one of those species habitates. I define a function jaccard for computing Jaccard indices from an arbitrary cohabitation matrix C as follows.

jaccard <- function (C) {
  U <- matrix(rep(diag(C), nrow(C)), ncol = nrow(C))
  H <- U + t(U) - C
  J <- C / H
  return (J)
}

If C = cohabits then each column of U is equal to the vector pokemon$ubiquity, and each entry H[i, j] of H counts the number of locations in which at least one of species i and j habitate. The Jaccard index J[i, j] obtains its maximum value of unity when species i and j habitate in precisely the same locations, and its minimum value of zero when they never cohabitate. The more similar two species’ habitat choices, the higher is their shared Jaccard index.

I define the cohabitation network net as the weighted graph with adjacency matrix equal to jaccard(cohabits):

library(igraph)

net <- graph.adjacency(jaccard(cohabits), weighted = T, mode = 'undirected')
net <- simplify(net)  # Remove loops

Identifying the strongest connections

The cohabitation network contains 1,549 (about 31%) of the 4,950 possible edges between its 100 vertices. However, many of these edges have low weight and correspond to weak social connections between species, whereas I’m most interested in identifying which species share strong connections.

I identify an edge-induced subgraph of net that represents the strongest connections as follows.³ First, I find a maximum spanning forest (MSF) of net; that is, an edge-induced subgraph that

has the same vertex set as net,
has trees as components, and
obtains the maximum edge weight sum over all edge-induced subgraphs satisfying criteria 1 and 2.

The MSF joins each species with one of the species with which it most frequently cohabitates. However, depending on the algorithm used, the MSF generally doesn’t join every species with its most frequent cohabitant and therefore doesn’t necessarily contain the strongest connections in net.⁴ Accordingly, I augment the MSF by taking its union with the subgraph induced by the edges in net of highest weight. I choose the number of such edges to be equal to the order of net so as to achieve a mean vertex degree of about four.

I define a function augmented_msf for identifying the augmented MSF of a graph G as follows.

augmented_msf <- function (G) {
  E(G)$id <- seq(gsize(G))
  msf_ids <- E(mst(G, -E(G)$weight))$id
  cutoff <- quantile(E(G)$weight, (gsize(G) - gorder(G)) / gsize(G))[1]
  aug_ids <- which(E(G)$weight >= cutoff)
  aug_msf <- subgraph.edges(G, eids = E(G)[unique(c(msf_ids, aug_ids))])
  return (aug_msf)
}

The third and fourth lines in the definition of augmented_msf identify the edges of G with which to augment its MSF. For example, if G has order 20 and size 100 then the MSF of G is augmented by adding those edges in G with weights equal to or greater than the weight of the edge at the 80th percentile.

Visualising the network

The augmented MSF of net contains 242 edges and is drawn below. Each vertex is coloured according to the corresponding species’ primary type and scaled according to that species’ ubiquity. I use Fruchterman and Reingold’s (1991) force-directed algorithm for determining vertices’ layout.

The cohabitation network has two components: one large component of 98 different species and many types, and one isolated pair of Ground-types. The latter contains Diglett and Dugtrio, which habitate exclusively in Diglett’s Cave. Water-types are most socially connected to other Water-types, suggesting that there are few amphibious species in the Kanto region that spend most of their time in the water. Poison-types tend to be closely connected to Ground- and Rock-types, which are, presumably, immune to toxicity.

The augmented MSF reveals two large, densely connected clusters of low ubiquity species. These clusters represent Cerulean Cave and the Kanto Safari Zone, and are directly bridged by Chansey, Parasect and Rhyhorn. There is also a small cluster of Fire- and Poison-types that cohabitate inside Pokémon Mansion, and a clique of four Bug-types found in Viridian Forest.

The structure of net reveals information about species’ social influence. A simple measure of such influence is the degree centrality of each species, which counts the number of other cohabitating species. The table below displays the species with the highest six degree centralities in the cohabitation network.

Species	Type	Degree
Goldeen	Water	82
Magikarp	Water	82
Poliwag	Water	82
Krabby	Water	69
Kingler	Water	64
Ditto	Normal	56

The three most degree-central species are also the three most ubiquitous and cohabitate with 82 of the 99 other species in my sample. Eight of the 10 most degree-central species are Water-types.

The betweenness centrality of each species measures the frequency with which that species lies on the shortest path between others in the cohabitation network. Intuitively, more betweenness-central species tend to have more control over the spread of information due to their relative criticality in other species’ communication channels.

The six most betweenness-central species are tabulated below. Goldeen, Magikarp and Poliwag are important conduits of information due to their high ubiquity. Cubone takes fifth place because it is the only species through which Gastly and Haunter—both found exclusively inside Pokémon Tower—can communicate with species in the Safari Zone.

Species	Betweenness
Goldeen	269.15
Magikarp	269.15
Poliwag	269.15
Ditto	202.20
Cubone	190.00
Krabby	169.77

The chart below compares species’ betweenness and degree centralities. With the exception of Cubone, more betweenness-central species tend to have more cohabitants. Water-types are relatively inefficient at accumulating betweenness centrality when they expand their social network, whereas Electric-types appear to gain a relatively large amount of betweenness centrality per extra cohabitant.

Species with densely connected social networks are unlikely to be very betweenness-central because their cohabitants can share information with each other directly. The probability that two of a species’ cohabitants also cohabitate is given by the transitivity of the corresponding vertex in net.

The chart below plots species’ betweenness centralities against their transitivity within the cohabitation network. The two attributes share a strong, negative and convex relationship. Species whose cohabitants also cohabitate are less betweenness-central because the former lack exclusive control of their cohabitants’ channels for sharing information. The exceptions to this trend are Cubone and Pikachu, which have unusually high and low betweenness centralities, respectively. Pikachu habitate in two locations (Viridian Forest and the Kanto Power Plant), each of which contain a small number of species that frequently cohabitate and that generally have much higher degree centralities. As a result, Pikachu have an unusually low betweenness centrality because their cohabitants are able to communicate with each other directly and with other species indirectly through their wider social networks.

The co-containment network

I recycle my method of analysing the cohabitation network among species in order to explore the co-containment network among locations. In the latter network, two locations are adjacent if and only if they contain a common species. I generate the co-containment network from a binary location-location adjacency matrix as follows.

cocontains <- t(habits) %*% habits
cocontains <- pmin(cocontains, 1)  # Remove parallel edges
location_net <- graph.adjacency(cocontains, mode = 'undirected')

The graph location_net contains 542 (about 60%) of the 903 possible edges between its 43 vertices.

The locations with the six highest mean ubiquities are tabulated below. Viridian City and Pallet Town have the least unique demographies; the few species that habitate in these locations tend to also habitate in many other locations. That Viridian City’s mean ubiquity and degree centrality are similar suggests that its four habitants usually cohabitate.

Location	Mean Ubiquity	Degree	Diversity
Viridian City	21.25	25	4
Pallet Town	18.40	25	5
Celadon City	16.40	25	5
Cerulean City	16.17	25	6
Cinnabar Island	16.00	25	7
Route 1	16.00	24	2

Finally, the table below shows the top six most betweenness-central locations. Route 10 appears to be an important junction for information flow between species. This is likely due to the diversity of its contained species, and that Routes 10 and 11 boast the highest degree centralities in the co-containment network. The Safari Zone, another highly diverse location, is also an important information relay.

Location	Betweenness	Degree	Diversity
Route 10	58.54	39	18
Safari Zone	55.89	33	27
Route 11	23.03	39	15
Cerulean Cave	20.28	31	28
Route 6	18.38	38	16
Sea Route 21	18.38	38	13

Restricting to random encounters excludes starter Pokémon, species obtainable only through evolution and “special” encounters (e.g., the Electrodes inside the Kanto Power Plant and the legendary birds) from the sample. ↩︎
Observations in this and all other charts are coloured by the corresponding species’ primary type, and are plotted with a small amount of noise in order to reveal coincident points that would otherwise be hidden. ↩︎
This technique is based on Hidalgo et al.‘s (2007) method of representing the product space of internationally traded goods. ↩︎
For example, consider applying a greedy algorithm such as Prim’s to a cohabitation network that contains (i) a large clique of species that cohabitate in a single location and (ii) several species that are spread across many different locations. The algorithm will first connect each species in the clique and then, in order to avoid creating cycles, branch out to connect the relatively weakly connected species until a spanning forest is formed. The resulting subgraph will be a MSF but will contain edges that have lower weights than some of the omitted edges in the clique. ↩︎

`\(t_i\)`	`\(t_j\)`	`\(x_{t_it_j}\)`
`\(a\)`	`\(a\)`	`\(x_{aa}\)`
`\(a\)`	`\(b\)`	`\(x_{ab}\)`
`\(b\)`	`\(a\)`	`\(x_{ba}\)`
`\(b\)`	`\(b\)`	`\(x_{bb}\)`

Network	`\(\E[d_i]\)`	`\(\widehat{\E[f_i]}\)`	`\(\E[f_i]\)`	`\(\Corr(d_i,f_i)\)`
`\(G_1\)`	1.43	1.6	1.43	1.00
`\(G_2\)`	1.43	1.6	1.57	0.20
`\(G_3\)`	1.43	1.6	1.71	-0.91

Coefficient	Estimate	Std. error
`\(\gamma_1\)`	0.965	0.290
`\(\gamma_2\)`	1.208	0.303
`\(\gamma_3\)`	-0.377	0.422

Ben Davies

*Armchair Expert* episodes

Decomposing matrices of pairwise minima

Delayed saving

Learning about a changing state

Learning from correlated signals

Simulating Wiener and Ornstein-Uhlenbeck processes

Inverting matrices of pairwise minima

Correlation and concordance

The option value of waiting

Learning in continuous time

Model

Learning dynamics

Deriving the belief increments

Paying for precision

Binary signals and posterior variances

Comparing equal- and value-weighted portfolios

Models of the AI apocalypse

Who reads *Marginal Revolution*?

Loan repayments

*Marginal Revolution* metadata

Authors

Titles

Publication times

Categories

Comments

Content

Five years of blogging

Words used

Traffic

Selection bias and fixed effects

Learning and persuasion

Protecting Planet Xiddler

Learning from opinions

stravadata demo

Computing annual totals

Making activity heat maps

Counting efforts

Making training calendars

Tracking personal records

Social networks in rural India

Data description

Inter-household networks

Inter-caste mixing

Optimal pacing with random energy costs

Allowing for random costs

Solving the problem

Ex ante expected times

Realized times

Optimal pacing with varying energy costs

The optimal pacing problem

Solving the two-lap case

Solving the general case

Using the Hamiltonian

Using the Bellman equation

Solution properties

Correlation and concatenation

The friendship paradox

Binary distributions and risky gambles

Estimating treatment effects with OLS

Why do experts give simple advice?

Dollar cost averaging

Homophily and the strength of moderate ties

Reflections on grad school: Years 1 and 2

First-year courses

Second-year courses

Quality of life

Paying for the truth

Why should academics blog?

What's it like living in America?

It’s always sunny in Palo Alto

How’s it different?

Dining out

Paying taxes

Healthcare

Talking to strangers

Talking generally

Scenery

Research incentives and the evolution of knowledge

Truth-seekers and ideologues

Armchair Expert episodes

Who reads Marginal Revolution?

Marginal Revolution metadata