Identifying cliques and clique comembers in R using the SNA package

The R package SNA provides a number of tools for analyzing social network data. This post reviews the function clique.census from the SNA package, and shows how it can be used to better understand the group structure among a list of network members.

Let’s start by creating a toy network. Say Henry, Sarah, Rick, Joe and Annie are all colleagues in a criminology department. Now imagine they’ve been asked to identify the most important people they collaborate with. The results of this fictitious effort are shown in the following edgelist.

edgelist <- matrix(c("Henry", "Sarah", "Henry", "Joe", "Sarah", "Joe", "Henry", "Sarah", "Henry", "Rick", "Sarah", "Rick", "Sarah", "Henry", "Sarah", "Rick", "Henry", "Rick", "Sarah", "Henry", "Sarah", "Joe", "Henry", "Joe", "Sarah", "Rick", "Sarah", "Annie", "Rick", "Annie"), ncol = 2)

Recall, a clique is a group where everyone in it “likes” everyone else. To identify cliques among our network of criminology researchers, we first transform it into a network object and then apply the SNA function clique.census.

library("network")

net tabulate.by.vertex=FALSE, enumerate=TRUE, clique.comembership="bysize") # Identify cliques
net_cc$clique.count

 

sna clique comembership 1

After applying the function clique.census, we see there were two cliques among our respondents, each involving three researchers.

To identify the comembers of these cliques, we inspect the contents of the variable net_cc$clique.comemb[3, , ].

To paraphrase the SNA documentation, the variable net_cc$clique.comemb is a three dimensional matrix of size max clique size x n x n. In this example we observed cliques involving only three network members each. As such, the 3-clique comembership information is stored in the variable net_cc$clique.comemb[3, , ]. (Note: if we observed one or more cliques with more than three members, say, a 4-clique, we could examine their comembership using the variable net_cc$clique.comemb[4, , ]).

Notably, the format by which net_cc$clique.comemb[3, , ] organizes clique comembership takes some getting used to. In fact, the main point of this post is to explain this organizational scheme in a more everyday kind of way.

Given our criminologist cliques, here’s how we find out who was in them. Recall, the largest clique we observed contained three individuals. Further recall that our network only contained five respondents. As such, the matrix net_cc$clique.comemb[3, , ] is of size 5 x 5. This matrix mimics the network structure itself. That is, network members are listed along the rows and, in the exact same order, listed again across the columns. The values within the matrix then identify researchers who were in cliques together.

sna clique comembership 2

Let’s go through a couple columns together to better understand what this matrix is telling us exactly.

The values in the column Joe show who Joe was in a clique with. Zero values indicate who Joe was not in a clique with, while values greater than zero indicate who he was in a clique with. More than just showing who Joe was in a clique with, these values identify the different cliques he was a part of. Reading the column from top to bottom, Joe and Annie were in zero cliques together. Next we see that Joe was in a clique with Henry. Notably, Joe was in one clique with himself (i.e., there was only one clique that involved Joe). The rest of the column values show that Joe was in zero cliques with Rick and one clique with Sarah. In total, this column tell us there was one 3-clique that involved Joe (see Joe’s value for Joe) and that this clique involved the researchers Henry and Sarah.

Let’s look at a trickier column. The values in the column for Henry ranged from 0 to 2. As with Joe, the values show who Henry was in a clique with and how many times they were in the same clique. Going down the column, we see a zero value for Henry and Annie. That is, Henry and Annie were not part of the same clique. Notably, Henry is in two cliques with himself. That is, there were two cliques, each of which involved Henry. The rest of the values show that Henry was in one clique with Joe, another clique with Rick and two cliques with Sarah.

Combined, these column values tell us there were two 3-cliques: one clique involving Henry, Sarah and Joe and another involving Henry, Sarah and Rick.

Advertisements

Omit outdated records after adding amended records

In the juvenile court judges often have the power to amend charges applied to a case. A charge of “INTENT TO DISTRIBUTE” may, for instance, be reduced to something less severe, say “POSSESSION OF PARAPHERNALIA.” When this occurs, a new charge is added to a case record and the old charge is, in some cases, retained. This short post presents R code to omit outdated charges after amended charges have been added.

Data for this example consist of charges applied to a single court record. These data are provided below:

charge=c("Count 1", "Count 2", "Amended", "Count 3", "Amended")
section=c("21-5807(a)(3)", "21-5807(a)(3)", "21-5801(b)(4)", "21-5807(a)(3)", "21-5801(b)(4)")
date=c("09/13/15", "09/20/15", "04/04/16", "10/03/15", "04/04/16")
title=c("BURGLARY OF MOTOR VEHICLE", "BURGLARY OF MOTOR VEHICLE", "THEFT", "BURGLARY OF MOTOR VEHICLE", "THEFT")
acs=c("", "", "", "", "")
drug=c("", "", "", "", "")
pl=c("", "", "", "", "")
finding=c("DISMISS BY PROS", "", "DISMISS BY PROS", "", "DISMISS BY PROS")
tp=c("F", "F", "M", "F", "M")
lvl=c("9", "9", "A", "9", "A")
pn=c("N", "N", "N", "N", "N")
sentence_date=c("", "", "04/04/2016", "", "04/04/2016")

df <- data.frame(charge, section, date, title, acs, drug, finding, tp, lvl, pn, sentence_date,stringsAsFactors=FALSE)

The goal here is to remove every charge that is followed by an amended charge. The following dataset illustrates the desired result:

charge=c("Count 1", "Amended", "Amended")
section=c("21-5807(a)(3)", "21-5801(b)(4)", "21-5801(b)(4)")
date=c("09/13/15", "04/04/16", "04/04/16")
title=c("BURGLARY OF MOTOR VEHICLE", "THEFT", "THEFT")
acs=c("", "", "")
drug=c("", "", "")
pl=c("", "", "")
finding=c("DISMISS BY PROS", "DISMISS BY PROS", "DISMISS BY PROS")
tp=c("F", "M", "M")
lvl=c("9", "A", "A")
pn=c("N", "N", "N")
sentence_date=c("", "04/04/2016", "04/04/2016")

df <- data.frame(charge, section, date, title, acs, drug, finding, tp, lvl, pn, sentence_date,stringsAsFactors=FALSE)

One way to achieve this new dataset is to identify all charges that are followed by an amended charge and extract all charges except these original charges.

First, the positions of the updated charges are identified. The following expression identifies the updated charges as TRUE and all other charges as FALSE.

c(df$charge=="Amended")

Second, the positions of the updated charges are used to identify the positions of the original charges that were amended. This is accomplished by recognizing a pattern in the charges themselves. Updated charges in this dataset are always preceded by their original charge. Since the updated charges were identified as TRUE (and the non-updated charges identified as FALSE), shifting these values by one position identifies the original charges.

c(c(df$charge=="Amended")[-1], c(df$charge=="Amended")[1])

Finally, the positions of the original charges (i.e., the charges followed by the amended ones) are used to extract the desired dataset. The desired dataset consists of every charge not followed by an updated charge.

df[!c(c(df$charge=="Amended")[-1], c(df$charge=="Amended")[1]),]

Plot Network Data in R with iGraph

I recently had a conversation on Twitter about a plot I made a while back. Recall, the plot showed my Twitter network, my friends and my friend’s friends.

Here’s the Twitter thread:

And here’s the R code:

#### Load R libraries
library("iGraph")

#### Load edgelist
r <- read.csv(file="edgelist_friends.csv-03-25.csv",header=TRUE,stringsAsFactors=FALSE)[,-1]

#### Convert to graph object
gr <- graph.data.frame(r,directed=TRUE)

#### gr
# Describe graph
summary(gr)
ecount(gr) # Edge count
vcount(gr) # Node count
diameter(gr) # Network diameter
farthest.nodes(gr) # Nodes furthest apart
V(gr)$indegree = degree(gr,mode="in") # Calculate indegree

#### Plot graph
E(gr)$color = "gray"
E(gr)$width = .5
E(gr)$arrow.width = .25
V(gr)$label.color = "black"
V(gr)$color = "dodgerblue"
V(gr)$size = 4

set.seed(40134541)
l <- layout.fruchterman.reingold(gr)

pdf("network_friends_plot.pdf")
plot(gr,layout=l,rescale=TRUE,axes=FALSE,ylim=c(-1,1),asp=0,vertex.label=NA)
dev.off()

Jitter scatterplot value positions with value labels in R using ggplot2

The following R code creates a scatterplot using ggplot2. Points on this plot are represented by identification numbers. The jitter option removes overlap between these plotted values.

#### Attach R libraries
library("ggplot2")


#### Generate random data set 
theData <- data.frame(id=1:20, xVar=sample(1:4, 20, replace=TRUE), yVar=sample(1:4, 20, replace=TRUE))


#### Plot scatterplot
set.seed(seed=658672)
p <- ggplot(theData)
p + theme_bw() + geom_text(aes(x=xVar,y=yVar,label=id),size=3,position=position_jitter(w=.2, h=.2))

Convert Qualitative Codes into a Binary Response Matrix in R

Content analysis is a qualitative method for identifying themes among a collection of documents. The themes themselves are either derived from the content reviewed or specified a priori according to some established theoretical perspective or set or research questions. Documents are read, content is considered and themes (represented as letter “codes”) are applied. It’s not uncommon for documents to exhibit multiple themes. In this way, results from a content analysis are not unlike responses to the “select all that apply” type questions found in survey research. Once a set of documents is coded, it’s often of interest to know the proportion of times the codes were observed.

The following R code transforms codes on a set of documents, stored as a list of lists, into a binary matrix.

#### Load R packages
library("XLConnect")
library("stringr")
library("vcd")


#### Working directory
getwd()
setwd("C:/Users/chernoff/Desktop")


#### Read data
theData <- readWorksheet(loadWorkbook("dummyData.xlsx"),sheet="Sheet1",header=TRUE)


#### Parse codes
theData2 <- str_extract_all(theData$codes,"[a-zA-Z]")

codeList <- unique(unlist(theData2))

theData2 <- lapply(X=theData2,function(X) as.data.frame(matrix(as.numeric(codeList%in%X),ncol=length(codeList),dimnames=list(c(),codeList))))
theData3 <- do.call(rbind,theData2)

theData4 <- cbind(theData,theData3)

If we print the data to the screen, we see themes are represented as binary variables, where 1 indicates a theme was observed and 0 indicates it was not.

binaryMatrix

Once the data are organized as a binary matrix, we can calculate column totals colSums(theData4[,codeList]) to see which themes were more popular and which ones were least popular.

binaryMatrixColSums

And lastly, if we want to get fancy, we can represent the data using a mosaic plot.

totals <- table(theData4$a,theData4$b,theData4$c,dnn=c("A","B","C"))

png("mosaicPlot.png")
mosaicplot(totals,sort=3:1,col=hcl(c(120,10)),"Mosaic Plot")
dev.off()

mosaicPlot

A mosaic plot shows the relative proportion of each theme compared to one or more of the other themes. The main two rows show the levels of theme B. The main two columns represent theme C’s levels. And the two columns within the two main columns represent the levels of A. By default, the label for theme A is not shown. The cell in the upper left-hand corner, i.e. cell (1,1), shows there were some but not many documents without any themes. Cells (1,3) and (1,4) show there were an equal number of documents with theme C as there were that involved themes A and C combined. The remaining cell in this first row (1,2) shows there were more documents pertaining solely to theme A than all other document types not containing theme B. Interpretations of the remaining rectangles follow similarly.

How to extract a network subgraph using R

In a previous post I wrote about highlighting a subgraph of a larger network graph. In response to this post, I was asked how extract a subgraph from a larger graph while retaining all essential characteristics among the extracted nodes.

Vinay wrote:

Dear Will,
The code is well written and only highlights the members of a subgraph. I need to fetch them out from the main graph as a separate subgraph (including nodes and edges). Any suggestions please.

Thanks.

Extract subgraph
For a given list of subgraph members, we can extract their essential characteristics (i.e., tie structure and attributes) from a larger graph using the iGraph function induced.subgraph(). For instance,

library(igraph)                   # Load R packages

set.seed(654654)                  # Set seed value, for reproducibility
g <- graph.ring(10)               # Generate random graph object
E(g)$label <- runif(10,0,1)       # Add an edge attribute

# Plot graph
png('graph.png')
par(mar=c(0,0,0,0))
plot.igraph(g)
dev.off()

g2 <- induced.subgraph(g, 1:7)    # Extract subgraph

# Plot subgraph
png('subgraph.png')
par(mar=c(0,0,0,0))
plot.igraph(g2)
dev.off()

Graph
graph

Subgraph
subgraph

Authorize a Twitter Data request in R

In order to get data from Twitter you have to let them know that you are in fact authorized to do so. The first step in getting data from Twitter is to create an application with Twitter. The interested reader should explore the section “Where do I create an application?” under the FAQ for instructions on how to create a Twitter application. To authenticate a data request from Twitter we simply need to send the appropriate credentials to Twitter along with our request for data. The site dev.twitter.com provides useful step-by-step guides on how to authenticate data requests (see specifically the documents creating a signature, percent encoding parameters and authorizing a request), guides which we’ll explore here in the context of the R dialect.

Install and load the R packages RCurl, bitops, digest, ROAuth and RJSONIO.

## Install R packages
install.packages('RCurl')
install.packages('bitops')
install.packages('digest')
install.packages('ROAuth')
install.packages('RJSONIO')

## Load R packages
library('bitops')
library('digest')
library('RCurl')
library('ROAuth')
library('RJSONIO')

Access your Twitter application’s oauth settings (available under the OAuth tools tab on your application page) and save them to a data frame object.

oauth <- data.frame(consumerKey='YoUrCoNsUmErKeY',consumerSecret='YoUrCoNsUmErSeCrEt',accessToken='YoUrAcCeSsToKeN',accessTokenSecret='YoUrAcCeSsToKeNsEcReT')

Each time you request data from Twitter you will need to provide them with a unique, randomly generated string of alphanumeric characters.

string <- paste(sample(c(letters[1:26],0:9),size=32,replace=T),collapse='') # Generate a random string of alphanumeric characters
string2 <- base64(string,encode=TRUE,mode='character') # Convert string to base64
nonce <- gsub('[^a-zA-Z0-9]','',string2,perl=TRUE) # Remove non-alphanumeric characters

Generate and save the current GMT system time in seconds. Each Twitter data request will want you to include the time in seconds at which the requests was (approximately) made.

timestamp <- as.character(floor(as.numeric(as.POSIXct(Sys.time(),tz='GMT'))))

Order the key/value pairs by the first letter of each key. In the current example you’ll notice that the labels to the left of the equal signs are situated in order of ascension. Once ordered, we percent encode the string to create a parameter string.

# Percent encode parameters 1
par1 <- '&resources=statuses'
par2 <- gsub(',','%2C',par1,perl=TRUE) # Percent encode par

# Percent ecode parameters 2
# Order the key/value pairs by the first letter of each key
ps <- paste('oauth_consumer_key=',oauth$consumerKey,'&oauth_nonce=',nonce[1],'&oauth_signature_method=HMAC-SHA1&oauth_timestamp=',timestamp,'&oauth_token=',oauth$accessToken,'&oauth_version=1.0',par2,sep='')
ps2 <- gsub('%','%25',ps,perl=TRUE) 
ps3 <- gsub('&','%26',ps2,perl=TRUE)
ps4 <- gsub('=','%3D',ps3,perl=TRUE)

The parameter string is then extended to include the HTTP method and the Twitter base URL so as to create a signature base string.

# Percent encode parameters 3
url1 <- 'https://api.twitter.com/1.1/application/rate_limit_status.json'
url2 <- gsub(':','%3A',url1,perl=TRUE) 
url3 <- gsub('/','%2F',url2,perl=TRUE) 

# Create signature base string
signBaseString <- paste('GET','&',url3,'&',ps4,sep='') 

We then create a signing key.

signKey <- paste(oauth$consumerSecret,'&',oauth$accessTokenSecret,sep='')

The signature base string and the signing key are used to create an oauth signature.

osign <- hmac(key=signKey,object=signBaseString,algo='sha1',serialize=FALSE,raw=TRUE)
osign641 <- base64(osign,encode=TRUE,mode='character')
osign642 <- gsub('/','%2F',osign641,perl=TRUE)
osign643 <- gsub('=','%3D',osign642,perl=TRUE)
osign644 <- gsub('[+]','%2B',osign643,perl=TRUE)

kv <-data.frame(nonce=nonce[1],timestamp=timestamp,osign=osign644[1])

These results can than be passed to the function getURL() so as to download the desired Twitter data, such as status rate limits.

fromJSON(getURL(paste('https://api.twitter.com/1.1/application/rate_limit_status.json?resources=statuses&oauth_consumer_key=',oauth$consumerKey,'&oauth_nonce=',kv$nonce,'&oauth_signature=',kv$osign,'&oauth_signature_method=HMAC-SHA1&oauth_timestamp=',kv$timestamp,'&oauth_token=',oauth$accessToken,'&oauth_version=1.0',sep='')))

Results
Screen Shot 2013-05-26 at 10.53.05 PM

To reduce repetition, I’ve wrapped the above code into an R function called keyValues.

keyValues <- function(httpmethod,baseurl,par1){	
# Generate a random string of letters and numbers
string <- paste(sample(c(letters[1:26],0:9),size=32,replace=T),collapse='') # Generate random string of alphanumeric characters
string2 <- base64(string,encode=TRUE,mode='character') # Convert string to base64
nonce <- gsub('[^a-zA-Z0-9]','',string2,perl=TRUE) # Remove non-alphanumeric characters

# Get the current GMT system time in seconds 
timestamp <- as.character(floor(as.numeric(as.POSIXct(Sys.time(),tz='GMT'))))

# Percent encode parameters 1
par2 <- gsub(',','%2C',par1,perl=TRUE) # Percent encode par

# Percent ecode parameters 2
# Order the key/value pairs by the first letter of each key
ps <- paste('oauth_consumer_key=',oauth$consumerKey,'&oauth_nonce=',nonce[1],'&oauth_signature_method=HMAC-SHA1&oauth_timestamp=',timestamp,'&oauth_token=',oauth$accessToken,'&oauth_version=1.0',par2,sep='')
ps2 <- gsub('%','%25',ps,perl=TRUE) 
ps3 <- gsub('&','%26',ps2,perl=TRUE)
ps4 <- gsub('=','%3D',ps3,perl=TRUE)

# Percent encode parameters 3
#url1 <- 'https://api.twitter.com/1.1/application/rate_limit_status.json'
url1 <- baseurl
url2 <- gsub(':','%3A',url1,perl=TRUE) 
url3 <- gsub('/','%2F',url2,perl=TRUE) 

# Create signature base string
signBaseString <- paste(httpmethod,'&',url3,'&',ps4,sep='') 

# Create signing key
signKey <- paste(oauth$consumerSecret,'&',oauth$accessTokenSecret,sep='')

# oauth_signature
osign <- hmac(key=signKey,object=signBaseString,algo='sha1',serialize=FALSE,raw=TRUE)
osign641 <- base64(osign,encode=TRUE,mode='character')
osign642 <- gsub('/','%2F',osign641,perl=TRUE)
osign643 <- gsub('=','%3D',osign642,perl=TRUE)
osign644 <- gsub('[+]','%2B',osign643,perl=TRUE)

return(data.frame(hm=httpmethod,bu=baseurl,p=par1,nonce=nonce[1],timestamp=timestamp,osign=osign644[1]))
}

Twitter data requests now appear, on the user end of things, to need less code through the use of this function. Make sure to run the keyValues() function and the fromJSON() function within a few seconds of each others, or else Twitter won’t respect your data request.

## Check rate limits
# Parameter options: help, users, search, statuses
kv <- keyValues(httpmethod='GET',baseurl='https://api.twitter.com/1.1/application/rate_limit_status.json',par1='&resources=statuses')

fromJSON(getURL(paste(kv$bu,'?','oauth_consumer_key=',oauth$consumerKey,'&oauth_nonce=',kv$nonce,'&oauth_signature=',kv$osign,'&oauth_signature_method=HMAC-SHA1&oauth_timestamp=',kv$timestamp,'&oauth_token=',oauth$accessToken,'&oauth_version=1.0',kv$p,sep='')))