Suppose I have data \((a_i,b_i)_{i=1}^n\)
on two random variables \(A\)
and \(B\)
.
I store my data as vectors a
and b
, and compute their correlation using the cor
function in R:
cor(a, b)
## [1] 0.4326075
Now suppose I append a mirrored version of my data by defining the vectors
alpha = c(a, b)
beta = c(b, a)
so that alpha
is a concatenation of the \(a_i\)
and \(b_i\)
values, and beta
is a concatenation of the \(b_i\)
and \(a_i\)
values.
I compute the correlation of alpha
and before
as before:
cor(alpha, beta)
## [1] 0.4288428
Notice that cor(a, b)
and cor(alpha, beta)
are not equal.
This surprised me.
How can appending a copy of the same data change the correlation within those data?
The answer is that the concatenated data \((\alpha_i,\beta_i)_{i=1}^{2n}\)
have different marginal distributions than the original data \((a_i,b_i)_{i=1}^n\)
.
Indeed one can show that
$$\DeclareMathOperator{\Cor}{Cor} \DeclareMathOperator{\Cov}{Cov} \DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{Var} \begin{align} \E[\alpha]=\E[\beta]=\frac{\E[a]+\E[b]}{2} \end{align}$$
and
$$\begin{align} \E[\alpha^2]=\E[\beta^2]=\frac{\E[a^2]+\E[b^2]}{2}, \end{align}$$
where
$$\E[\alpha]\equiv\frac{1}{2n}\sum_{i=1}^n\alpha_i$$
is the empirical mean of the \(\alpha_i\)
values, and where \(\E[\beta]\)
, \(\E[a]\)
, and \(\E[b]\)
are defined similarly.
It turns out that \(\E[\alpha\beta]=\E[ab]\)
, but since the marginal distributions are different the empirical correlations are different.
In fact
$$\Cor(\alpha,\beta)=\frac{\Cov(a,b)-0.25\left(\E[a]+\E[b]\right)^2}{0.5\Var(a)+0.5\Var(b)+0.25\left(\E[a]-\E[b]\right)^2},$$
where \(\Cor\)
, \(\Cov\)
, and \(\Var\)
are the empirical correlation, covariance, and variance operators.
This expression implies that cor(alpha, beta)
and cor(a, b)
will be equal if the \(a_i\)
and \(b_i\)
values have the same means and variances.
We can achieve this by scaling a
and b
before computing their correlation:
cor(scale(a), scale(b))
## [1] 0.4326075
The scale
function de-means its argument and scales it to have unit variance.
These operations don’t change the correlation of a
and b
.
But they do change the correlation of alpha
and beta
:
alpha = c(scale(a), scale(b))
beta = c(scale(b), scale(a))
cor(alpha, beta)
## [1] 0.4326075
Now the two correlations agree!
I came across this phenomenon while writing my previous post, in which I discuss the degree assortativity among nodes in Zachary’s (1977) karate club network.
One way to measure this assortativity is to use the degree_assortativity
function in igraph:
library(igraph)
G = graph.famous('Zachary')
assortativity_degree(G)
## [1] -0.4756131
This function returns the correlation of the degrees of adjacent nodes in G
.
Another way to compute this correlation is to
- construct a matrix
el
in which rows correspond to edges and columns list incident nodes; - define the vectors
d1
andd2
of degrees among the nodes listed inel
; - compute the correlation of
d1
andd2
usingcor
.
Here’s what I get when I take those three steps:
el = as_edgelist(G)
d = degree(G)
d1 = d[el[, 1]] # Ego degrees
d2 = d[el[, 2]] # Alter degrees
cor(d1, d2)
## [1] -0.4769563
Notice that cor(d1, d2)
disagrees with the value of assortativity_degree(G)
computed above.
This is because the vectors d1
and d2
have different means and variances:
c(mean(d1), mean(d2))
## [1] 7.487179 8.051282
c(var(d1), var(d2))
## [1] 25.94139 32.23110
These differences come from el
listing each edge only once: it includes a row c(i, j)
for the edge between nodes \(i\)
and \(j\not=i\)
, but not a row c(j, i)
.
Whereas assortativity_degree
accounts for edges being undirected by adding the row c(j, i)
before computing the correlation.
This is analogous to the “append the mirrored data” step I took to create \((\alpha_i,\beta_i)_{i=1}^{2n}\)
above.
Appending the mirror of el
to itself before computing cor(d1, d2)
returns the same value as assortativity_degree(G)
:
el = rbind(
el,
matrix(c(el[, 2], el[, 1]), ncol = 2) # el's mirror
)
d1 = d[el[, 1]]
d2 = d[el[, 2]]
c(assortativity_degree(G), cor(d1, d2))
## [1] -0.4756131 -0.4756131