IndianVillages is a new R package containing data on social networks in rural India. I derived these data from Banerjee et al.‘s (2013) surveys of households across 75 Karnatakan villages. This post describes the derived data and the networks they define. I also show that the networks are assortatively mixed with respect to caste.

Data description

IndianVillages provides two tables. The first, households, links each household to its village and caste:

library(dplyr)
library(IndianVillages)

head(households)
## # A tibble: 6 × 3
##    hhid village caste
##   <dbl>   <dbl> <chr>
## 1  1001       1 <NA> 
## 2  1002       1 <NA> 
## 3  1003       1 <NA> 
## 4  1004       1 <NA> 
## 5  1005       1 <NA> 
## 6  1006       1 <NA>

The hhid and village columns store household and village IDs. The caste column stores caste memberships:

count(households, caste, sort = T)
## # A tibble: 6 × 2
##   caste               n
##   <chr>           <int>
## 1 OBC              5517
## 2 <NA>             4455
## 3 Scheduled Caste  2584
## 4 General          1371
## 5 Scheduled Tribe   618
## 6 Minority          359

Some caste values are missing because the surveys were changed during their collection. About 53% of the households with known castes are in the Other Backward Class (“OBC”). This exceeds the (disputed) share of OBCs in India’s general population during the survey period.

The second table, household_relationships, contains information on inter-household relationships:

head(household_relationships)
## # A tibble: 6 × 4
##   hhid.x hhid.y village type                        
##    <dbl>  <dbl>   <dbl> <fct>                       
## 1   1001   1002       1 Help with a decision        
## 2   1001   1002       1 Borrow kerosene or rice from
## 3   1001   1002       1 Lend kerosene or rice to    
## 4   1001   1002       1 Are related to              
## 5   1001   1002       1 Invite to one's home        
## 6   1001   1002       1 Visit in another's home

The hhid.x and hhid.y columns store ego and alter household IDs. The type column stores relationship types:

count(household_relationships, type, sort = T)
## # A tibble: 12 × 2
##    type                             n
##    <fct>                        <int>
##  1 Visit in another's home      33629
##  2 Invite to one's home         32652
##  3 Engage socially with         30939
##  4 Borrow money from            25514
##  5 Lend kerosene or rice to     23993
##  6 Borrow kerosene or rice from 23743
##  7 Lend money to                23558
##  8 Obtain medical advice from   22310
##  9 Help with a decision         17228
## 10 Are related to               16037
## 11 Give advice to               15613
## 12 Go to temple with             2700

These types correspond to questions asked in Banerjee et al.‘s surveys.

Inter-household networks

We can use households and household_relationships to define social networks among the households in each village. First, use the graph_from_data_frame function from igraph to create the network among all households:

library(igraph)

net = graph_from_data_frame(
  distinct(household_relationships, hhid.x, hhid.y),
  directed = F,
  vertices = households
)

net contains 66,862 edges: one for each pair of households with at least one social relationship. There are no between-village relationships in the data, so we can partition net into village-specific networks without deleting any edges:

library(purrr)

villages = sort(unique(households$village))

village_nets = map(villages, ~subgraph(net, V(net)$village == .))

sum(map_dbl(village_nets, gsize))  # Same as gsize(net)
## [1] 66862

The networks in village_nets are too large to describe visually. Instead, let’s compute some of their properties:

village_nets_properties = map_df(village_nets, ~{
  comp = components(.)
  giant = subgraph(., comp$membership == which.max(comp$csize))
  tibble(
    Households = gorder(.),
    `Mean degree` = mean(degree(.)),
    `% of households in giant` = 100 * gorder(giant) / gorder(.),
    `Mean distance in giant` = mean_distance(giant)
  )
})

I summarize these properties in the table below. The number of households in each village ranges from 77 to 356. The mean degree of the households in each village ranges from 6.11 to 13.44. Most households are in the giant component for their village, and are connected to others in that component via paths of length two or three.

Property Mean Std. dev. Min. Median Max.
Households 198.72 59.29 77.00 190.00 356.00
Mean degree 8.90 1.61 6.11 8.72 13.44
% of households in giant 95.10 2.71 84.62 95.54 99.42
Mean distance in giant 2.75 0.21 2.30 2.72 3.32

Inter-caste mixing

We can use net to study the extent of assortative mixing with respect to caste membership. First, delete the 4,455 households with missing caste values:

subnet = subgraph(net, !is.na(V(net)$caste))

subnet contains 10,449 households with a mean degree of 9.08. This is similar to the mean degree in net. The two networks also have similar mean distances between connected households: 2.85 in subnet, versus 2.81 in net.

Next, compute subnet's mixing matrix:

library(bldr)  # https://github.com/bldavies/bldr

mix_mat = get_mixing_matrix(subnet, 'caste')

I define get_mixing_matrix here. It returns a matrix in which rows and columns correspond to castes, and entries equal the share of edges joining households in each caste pair. Multiplying these entries by the sum of degrees—which, by the degree sum formula, equals twice the number of edges—yields a table of inter-caste edge counts:

mix_mat * (2 * gsize(subnet))
##             
##              General Minority   OBC Sch. Caste Sch. Tribe
##   General       8680       79  3118        932        521
##   Minority        79     1860   381        156         84
##   OBC           3118      381 40058       4325       2241
##   Sch. Caste     932      156  4325      16074        910
##   Sch. Tribe     521       84  2241        910       2722

For example, subnet contains 3,118 edges between households in general castes and households in OBC castes.

We can measure the extent of assortative mixing by comparing mix_mat to the matrix we’d expect if edges were independent of caste. This matrix equals the outer product of the row and column sums of mix_mat:

mix_mat_indep = rowSums(mix_mat) %*% t(colSums(mix_mat))

Comparing the traces of mix_mat and mix_mat_indep allows us to measure mixing overall:

tr = function(m) sum(diag(m))

c(tr(mix_mat), tr(mix_mat_indep))
## [1] 0.7313254 0.3598672

So subnet contains about twice as many within-caste edges than we’d expect if edges were independent of caste.

We can also compare mix_mat and mix_mat_indep element-wise to assess which inter-caste relationships are most over-represented:

round(mix_mat / mix_mat_indep, 2)
##             
##              General Minority   OBC Sch. Caste Sch. Tribe
##   General       4.64     0.22  0.44       0.30       0.57
##   Minority      0.22    26.93  0.28       0.26       0.48
##   OBC           0.44     0.28  1.51       0.37       0.65
##   Sch. Caste    0.30     0.26  0.37       3.04       0.60
##   Sch. Tribe    0.57     0.48  0.65       0.60       6.15

So, for example, there are about 51% more OBC-OBC edges than we’d expect if edges were independent of caste, but less than half as many general-OBC edges.