In my last post, I compared measures of similarity among college degree fields. My goal in this post is to partition the set of fields such that each field has greater within-part similarities than between-part similarities. One approach is to hierarchically cluster fields based on their similarities, producing a dendrogram that can be cut at different heights to obtain different partitions. Generating the dendrogram restricts my choice set but, ultimately, I still have to choose which partition is “best.”
The intellectually honest way forward is to define an objective function on the set of partitions and choose the partition that obtains the function’s maximum. One such function is network modularity, which captures the extent to which groups of nodes are intra-connected densely but inter-connected sparsely. Ranking partitions by modularity removes the need for supervision: rather than making a subjective, potentially biased judgment on which partition is “best,” I simply choose the partition that maximises modularity.
Unfortunately, maximising modularity is hard. In most cases, finding the globally optimal partition is infeasible and a heuristic algorithm must be used to find an approximate solution. Clauset et al. (2004) suggest a greedy algorithm:
The term “community” refers to a set of nodes and stems from the use of network science to probe the community structure of social interactions.
I apply Clauset et al.‘s algorithm to the networks defined using the co-occurrence, Dice, Jaccard, Ochiai and overlap measures discussed in my previous post, as well as the unweighted network in which fields are adjacent if at least one graduate studied them both. The table below presents the number and size of communities detected in each network, and the corresponding maximised modularity values.
|Network||Communities||Fields||Community sizes (millions of graduates)||Modularity|
Clauset et al.‘s algorithm detects eight communities in the Dice, Jaccard, Ochiai and overlap similarity networks, with each community containing at least nine fields and at most 50 fields. The Jaccard measure delivers the greatest maximum modularity. Ignoring edge weights makes within- and between-part connections harder to separate, leading to few communities being detected.
I identify the “representives” of each community as the fields with the largest ratios of mean within- and between-community similarities. I transform these ratios by taking their natural logarithm in order to rein in the extreme values caused by near-zero divisors. The following bar chart presents the representatives of each community detected in the Jaccard similarity network.
Communities 2, 3, 4, 5, 7 and 8 appear to capture business, engineering, media, education, agriculture and biology-related fields. Communities 1 and 6 are less clearly classifiable.
The table below presents the demographic compositions of the eight communities detected in the Jaccard similarity network. Community 3 contains nearly 30% of degree fields but only about 20% of graduates, and is the most male-dominated among the eight communities detected. Community 5 is the most female-dominated and has the highest mean age. Educational attainment is lowest in communities 2 and 4, and highest in community 8.
|Community||Fields||Total graduates (millions)||Mean graduate age||% of graduates female||% of graduates with post-graduate degree|