Suppose I have data (ai,bi)i=1n on two random variables A and B. I store my data as vectors a and b, and compute their correlation using the cor function in R:

cor(a, b)
## [1] 0.4326075

Now suppose I append a mirrored version of my data by defining the vectors

alpha = c(a, b)
beta = c(b, a)

so that alpha is a concatenation of the ai and bi values, and beta is a concatenation of the bi and ai values. I compute the correlation of alpha and before as before:

cor(alpha, beta)
## [1] 0.4288428

Notice that cor(a, b) and cor(alpha, beta) are not equal. This surprised me. How can appending a copy of the same data change the correlation within those data?

The answer is that the concatenated data (αi,βi)i=12n have different marginal distributions than the original data (ai,bi)i=1n. Indeed one can show that E[α]=E[β]=E[a]+E[b]2 and E[α2]=E[β2]=E[a2]+E[b2]2, where E[α]12ni=1nαi is the empirical mean of the αi values, and where E[β], E[a], and E[b] are defined similarly. It turns out that E[αβ]=E[ab], but since the marginal distributions are different the empirical correlations are different. In fact Cor(α,β)=Cov(a,b)0.25(E[a]+E[b])20.5Var(a)+0.5Var(b)+0.25(E[a]E[b])2, where Cor, Cov, and Var are the empirical correlation, covariance, and variance operators. This expression implies that cor(alpha, beta) and cor(a, b) will be equal if the ai and bi values have the same means and variances. We can achieve this by scaling a and b before computing their correlation:

cor(scale(a), scale(b))
## [1] 0.4326075

The scale function de-means its argument and scales it to have unit variance. These operations don’t change the correlation of a and b. But they do change the correlation of alpha and beta:

alpha = c(scale(a), scale(b))
beta = c(scale(b), scale(a))

cor(alpha, beta)
## [1] 0.4326075

Now the two correlations agree!

I came across this phenomenon while writing my previous post, in which I discuss the degree assortativity among nodes in Zachary’s (1977) karate club network. One way to measure this assortativity is to use the degree_assortativity function in igraph:

library(igraph)

G = graph.famous('Zachary')

assortativity_degree(G)
## [1] -0.4756131

This function returns the correlation of the degrees of adjacent nodes in G. Another way to compute this correlation is to

  1. construct a matrix el in which rows correspond to edges and columns list incident nodes;
  2. define the vectors d1 and d2 of degrees among the nodes listed in el;
  3. compute the correlation of d1 and d2 using cor.

Here’s what I get when I take those three steps:

el = as_edgelist(G)

d = degree(G)
d1 = d[el[, 1]]  # Ego degrees
d2 = d[el[, 2]]  # Alter degrees

cor(d1, d2)
## [1] -0.4769563

Notice that cor(d1, d2) disagrees with the value of assortativity_degree(G) computed above. This is because the vectors d1 and d2 have different means and variances:

c(mean(d1), mean(d2))
## [1] 7.487179 8.051282
c(var(d1), var(d2))
## [1] 25.94139 32.23110

These differences come from el listing each edge only once: it includes a row c(i, j) for the edge between nodes i and ji, but not a row c(j, i). Whereas assortativity_degree accounts for edges being undirected by adding the row c(j, i) before computing the correlation. This is analogous to the “append the mirrored data” step I took to create (αi,βi)i=12n above. Appending the mirror of el to itself before computing cor(d1, d2) returns the same value as assortativity_degree(G):

el = rbind(
  el,
  matrix(c(el[, 2], el[, 1]), ncol = 2)  # el's mirror
)

d1 = d[el[, 1]]
d2 = d[el[, 2]]

c(assortativity_degree(G), cor(d1, d2))
## [1] -0.4756131 -0.4756131