IndianVillages is a new R package containing data on social networks in rural India. I derived these data from Banerjee et al.‘s (2013) surveys of households across 75 Karnatakan villages. This post describes the derived data and the networks they define. I also show that the networks are assortatively mixed with respect to caste.
Data description
IndianVillages provides two tables.
The first, households
, links each household to its village and caste:
library(dplyr)
library(IndianVillages)
head(households)
## # A tibble: 6 × 3
## hhid village caste
## <dbl> <dbl> <chr>
## 1 1001 1 <NA>
## 2 1002 1 <NA>
## 3 1003 1 <NA>
## 4 1004 1 <NA>
## 5 1005 1 <NA>
## 6 1006 1 <NA>
The hhid
and village
columns store household and village IDs.
The caste
column stores caste memberships:
count(households, caste, sort = T)
## # A tibble: 6 × 2
## caste n
## <chr> <int>
## 1 OBC 5517
## 2 <NA> 4455
## 3 Scheduled Caste 2584
## 4 General 1371
## 5 Scheduled Tribe 618
## 6 Minority 359
Some caste
values are missing because the surveys were changed during their collection.
About 53% of the households with known castes are in the Other Backward Class (“OBC”).
This exceeds the (disputed) share of OBCs in India’s general population during the survey period.
The second table, household_relationships
, contains information on inter-household relationships:
head(household_relationships)
## # A tibble: 6 × 4
## hhid.x hhid.y village type
## <dbl> <dbl> <dbl> <fct>
## 1 1001 1002 1 Help with a decision
## 2 1001 1002 1 Borrow kerosene or rice from
## 3 1001 1002 1 Lend kerosene or rice to
## 4 1001 1002 1 Are related to
## 5 1001 1002 1 Invite to one's home
## 6 1001 1002 1 Visit in another's home
The hhid.x
and hhid.y
columns store ego and alter household IDs.
The type
column stores relationship types:
count(household_relationships, type, sort = T)
## # A tibble: 12 × 2
## type n
## <fct> <int>
## 1 Visit in another's home 33629
## 2 Invite to one's home 32652
## 3 Engage socially with 30939
## 4 Borrow money from 25514
## 5 Lend kerosene or rice to 23993
## 6 Borrow kerosene or rice from 23743
## 7 Lend money to 23558
## 8 Obtain medical advice from 22310
## 9 Help with a decision 17228
## 10 Are related to 16037
## 11 Give advice to 15613
## 12 Go to temple with 2700
These types correspond to questions asked in Banerjee et al.‘s surveys.
Inter-household networks
We can use households
and household_relationships
to define social networks among the households in each village.
First, use the graph_from_data_frame
function from igraph to create the network among all households:
library(igraph)
net = graph_from_data_frame(
distinct(household_relationships, hhid.x, hhid.y),
directed = F,
vertices = households
)
net
contains 66,862 edges: one for each pair of households with at least one social relationship.
There are no between-village relationships in the data, so we can partition net
into village-specific networks without deleting any edges:
library(purrr)
villages = sort(unique(households$village))
village_nets = map(villages, ~subgraph(net, V(net)$village == .))
sum(map_dbl(village_nets, gsize)) # Same as gsize(net)
## [1] 66862
The networks in village_nets
are too large to describe visually.
Instead, let’s compute some of their properties:
village_nets_properties = map_df(village_nets, ~{
comp = components(.)
giant = subgraph(., comp$membership == which.max(comp$csize))
tibble(
Households = gorder(.),
`Mean degree` = mean(degree(.)),
`% of households in giant` = 100 * gorder(giant) / gorder(.),
`Mean distance in giant` = mean_distance(giant)
)
})
I summarize these properties in the table below. The number of households in each village ranges from 77 to 356. The mean degree of the households in each village ranges from 6.11 to 13.44. Most households are in the giant component for their village, and are connected to others in that component via paths of length two or three.
Property | Mean | Std. dev. | Min. | Median | Max. |
---|---|---|---|---|---|
Households | 198.72 | 59.29 | 77.00 | 190.00 | 356.00 |
Mean degree | 8.90 | 1.61 | 6.11 | 8.72 | 13.44 |
% of households in giant | 95.10 | 2.71 | 84.62 | 95.54 | 99.42 |
Mean distance in giant | 2.75 | 0.21 | 2.30 | 2.72 | 3.32 |
Inter-caste mixing
We can use net
to study the extent of assortative mixing with respect to caste membership.
First, delete the 4,455 households with missing caste
values:
subnet = subgraph(net, !is.na(V(net)$caste))
subnet
contains 10,449 households with a mean degree of 9.08.
This is similar to the mean degree in net
.
The two networks also have similar mean distances between connected households: 2.85 in subnet
, versus 2.81 in net
.
Next, compute subnet
's mixing matrix:
library(bldr) # https://github.com/bldavies/bldr
mix_mat = get_mixing_matrix(subnet, 'caste')
I define get_mixing_matrix
here.
It returns a matrix in which rows and columns correspond to castes, and entries equal the share of edges joining households in each caste pair.
Multiplying these entries by the sum of degrees—which, by the degree sum formula, equals twice the number of edges—yields a table of inter-caste edge counts:
mix_mat * (2 * gsize(subnet))
##
## General Minority OBC Sch. Caste Sch. Tribe
## General 8680 79 3118 932 521
## Minority 79 1860 381 156 84
## OBC 3118 381 40058 4325 2241
## Sch. Caste 932 156 4325 16074 910
## Sch. Tribe 521 84 2241 910 2722
For example, subnet
contains 3,118 edges between households in general castes and households in OBC castes.
We can measure the extent of assortative mixing by comparing mix_mat
to the matrix we’d expect if edges were independent of caste.
This matrix equals the outer product of the row and column sums of mix_mat
:
mix_mat_indep = rowSums(mix_mat) %*% t(colSums(mix_mat))
Comparing the traces of mix_mat
and mix_mat_indep
allows us to measure mixing overall:
tr = function(m) sum(diag(m))
c(tr(mix_mat), tr(mix_mat_indep))
## [1] 0.7313254 0.3598672
So subnet
contains about twice as many within-caste edges than we’d expect if edges were independent of caste.
We can also compare mix_mat
and mix_mat_indep
element-wise to assess which inter-caste relationships are most over-represented:
round(mix_mat / mix_mat_indep, 2)
##
## General Minority OBC Sch. Caste Sch. Tribe
## General 4.64 0.22 0.44 0.30 0.57
## Minority 0.22 26.93 0.28 0.26 0.48
## OBC 0.44 0.28 1.51 0.37 0.65
## Sch. Caste 0.30 0.26 0.37 3.04 0.60
## Sch. Tribe 0.57 0.48 0.65 0.60 6.15
So, for example, there are about 51% more OBC-OBC edges than we’d expect if edges were independent of caste, but less than half as many general-OBC edges.