Correlation and concatenation

Suppose I have data $(a_{i}, b_{i})_{i = 1}^{n}$ on two random variables $A$ and $B$ . I store my data as vectors a and b, and compute their correlation using the cor function in R:

cor(a, b)

## [1] 0.4326075

Now suppose I append a mirrored version of my data by defining the vectors

alpha = c(a, b)
beta = c(b, a)

so that alpha is a concatenation of the $a_{i}$ and $b_{i}$ values, and beta is a concatenation of the $b_{i}$ and $a_{i}$ values. I compute the correlation of alpha and before as before:

cor(alpha, beta)

## [1] 0.4288428

Notice that cor(a, b) and cor(alpha, beta) are not equal. This surprised me. How can appending a copy of the same data change the correlation within those data?

The answer is that the concatenated data $(α_{i}, β_{i})_{i = 1}^{2 n}$ have different marginal distributions than the original data $(a_{i}, b_{i})_{i = 1}^{n}$ . Indeed one can show that $\begin{aligned} E [α] = E [β] = \frac{E [a] + E [b]}{2} \end{aligned}$ and $\begin{aligned} E [α^{2}] = E [β^{2}] = \frac{E [a^{2}] + E [b^{2}]}{2}, \end{aligned}$ where $E [α] \equiv \frac{1}{2 n} \sum_{i = 1}^{n} α_{i}$ is the empirical mean of the $α_{i}$ values, and where $E [β]$ , $E [a]$ , and $E [b]$ are defined similarly. It turns out that $E [α β] = E [a b]$ , but since the marginal distributions are different the empirical correlations are different. In fact $Cor (α, β) = \frac{Cov (a, b) - 0.25 {(E [a] + E [b])}^{2}}{0.5 Var (a) + 0.5 Var (b) + 0.25 {(E [a] - E [b])}^{2}},$ where $Cor$ , $Cov$ , and $Var$ are the empirical correlation, covariance, and variance operators. This expression implies that cor(alpha, beta) and cor(a, b) will be equal if the $a_{i}$ and $b_{i}$ values have the same means and variances. We can achieve this by scaling a and b before computing their correlation:

cor(scale(a), scale(b))

## [1] 0.4326075

The scale function de-means its argument and scales it to have unit variance. These operations don’t change the correlation of a and b. But they do change the correlation of alpha and beta:

alpha = c(scale(a), scale(b))
beta = c(scale(b), scale(a))

cor(alpha, beta)

## [1] 0.4326075

Now the two correlations agree!

I came across this phenomenon while writing my previous post, in which I discuss the degree assortativity among nodes in Zachary’s (1977) karate club network. One way to measure this assortativity is to use the degree_assortativity function in igraph:

library(igraph)

G = graph.famous('Zachary')

assortativity_degree(G)

## [1] -0.4756131

This function returns the correlation of the degrees of adjacent nodes in G. Another way to compute this correlation is to

construct a matrix el in which rows correspond to edges and columns list incident nodes;
define the vectors d1 and d2 of degrees among the nodes listed in el;
compute the correlation of d1 and d2 using cor.

Here’s what I get when I take those three steps:

el = as_edgelist(G)

d = degree(G)
d1 = d[el[, 1]]  # Ego degrees
d2 = d[el[, 2]]  # Alter degrees

cor(d1, d2)

## [1] -0.4769563

Notice that cor(d1, d2) disagrees with the value of assortativity_degree(G) computed above. This is because the vectors d1 and d2 have different means and variances:

c(mean(d1), mean(d2))

## [1] 7.487179 8.051282

c(var(d1), var(d2))

## [1] 25.94139 32.23110

These differences come from el listing each edge only once: it includes a row c(i, j) for the edge between nodes $i$ and $j \neq i$ , but not a row c(j, i). Whereas assortativity_degree accounts for edges being undirected by adding the row c(j, i) before computing the correlation. This is analogous to the “append the mirrored data” step I took to create $(α_{i}, β_{i})_{i = 1}^{2 n}$ above. Appending the mirror of el to itself before computing cor(d1, d2) returns the same value as assortativity_degree(G):

el = rbind(
  el,
  matrix(c(el[, 2], el[, 1]), ncol = 2)  # el's mirror
)

d1 = d[el[, 1]]
d2 = d[el[, 2]]

c(assortativity_degree(G), cor(d1, d2))

## [1] -0.4756131 -0.4756131