Identifying cliques and clique comembers in R using the SNA package

The R package SNA provides a number of tools for analyzing social network data. This post reviews the function clique.census from the SNA package, and shows how it can be used to better understand the group structure among a list of network members.

Let’s start by creating a toy network. Say Henry, Sarah, Rick, Joe and Annie are all colleagues in a criminology department. Now imagine they’ve been asked to identify the most important people they collaborate with. The results of this fictitious effort are shown in the following edgelist.

edgelist <- matrix(c("Henry", "Sarah", "Henry", "Joe", "Sarah", "Joe", "Henry", "Sarah", "Henry", "Rick", "Sarah", "Rick", "Sarah", "Henry", "Sarah", "Rick", "Henry", "Rick", "Sarah", "Henry", "Sarah", "Joe", "Henry", "Joe", "Sarah", "Rick", "Sarah", "Annie", "Rick", "Annie"), ncol = 2)

Recall, a clique is a group where everyone in it “likes” everyone else. To identify cliques among our network of criminology researchers, we first transform it into a network object and then apply the SNA function clique.census.

library("network")

net tabulate.by.vertex=FALSE, enumerate=TRUE, clique.comembership="bysize") # Identify cliques
net_cc$clique.count

 

sna clique comembership 1

After applying the function clique.census, we see there were two cliques among our respondents, each involving three researchers.

To identify the comembers of these cliques, we inspect the contents of the variable net_cc$clique.comemb[3, , ].

To paraphrase the SNA documentation, the variable net_cc$clique.comemb is a three dimensional matrix of size max clique size x n x n. In this example we observed cliques involving only three network members each. As such, the 3-clique comembership information is stored in the variable net_cc$clique.comemb[3, , ]. (Note: if we observed one or more cliques with more than three members, say, a 4-clique, we could examine their comembership using the variable net_cc$clique.comemb[4, , ]).

Notably, the format by which net_cc$clique.comemb[3, , ] organizes clique comembership takes some getting used to. In fact, the main point of this post is to explain this organizational scheme in a more everyday kind of way.

Given our criminologist cliques, here’s how we find out who was in them. Recall, the largest clique we observed contained three individuals. Further recall that our network only contained five respondents. As such, the matrix net_cc$clique.comemb[3, , ] is of size 5 x 5. This matrix mimics the network structure itself. That is, network members are listed along the rows and, in the exact same order, listed again across the columns. The values within the matrix then identify researchers who were in cliques together.

sna clique comembership 2

Let’s go through a couple columns together to better understand what this matrix is telling us exactly.

The values in the column Joe show who Joe was in a clique with. Zero values indicate who Joe was not in a clique with, while values greater than zero indicate who he was in a clique with. More than just showing who Joe was in a clique with, these values identify the different cliques he was a part of. Reading the column from top to bottom, Joe and Annie were in zero cliques together. Next we see that Joe was in a clique with Henry. Notably, Joe was in one clique with himself (i.e., there was only one clique that involved Joe). The rest of the column values show that Joe was in zero cliques with Rick and one clique with Sarah. In total, this column tell us there was one 3-clique that involved Joe (see Joe’s value for Joe) and that this clique involved the researchers Henry and Sarah.

Let’s look at a trickier column. The values in the column for Henry ranged from 0 to 2. As with Joe, the values show who Henry was in a clique with and how many times they were in the same clique. Going down the column, we see a zero value for Henry and Annie. That is, Henry and Annie were not part of the same clique. Notably, Henry is in two cliques with himself. That is, there were two cliques, each of which involved Henry. The rest of the values show that Henry was in one clique with Joe, another clique with Rick and two cliques with Sarah.

Combined, these column values tell us there were two 3-cliques: one clique involving Henry, Sarah and Joe and another involving Henry, Sarah and Rick.

Advertisements

Plot Network Data in R with iGraph

I recently had a conversation on Twitter about a plot I made a while back. Recall, the plot showed my Twitter network, my friends and my friend’s friends.

Here’s the Twitter thread:

And here’s the R code:

#### Load R libraries
library("iGraph")

#### Load edgelist
r <- read.csv(file="edgelist_friends.csv-03-25.csv",header=TRUE,stringsAsFactors=FALSE)[,-1]

#### Convert to graph object
gr <- graph.data.frame(r,directed=TRUE)

#### gr
# Describe graph
summary(gr)
ecount(gr) # Edge count
vcount(gr) # Node count
diameter(gr) # Network diameter
farthest.nodes(gr) # Nodes furthest apart
V(gr)$indegree = degree(gr,mode="in") # Calculate indegree

#### Plot graph
E(gr)$color = "gray"
E(gr)$width = .5
E(gr)$arrow.width = .25
V(gr)$label.color = "black"
V(gr)$color = "dodgerblue"
V(gr)$size = 4

set.seed(40134541)
l <- layout.fruchterman.reingold(gr)

pdf("network_friends_plot.pdf")
plot(gr,layout=l,rescale=TRUE,axes=FALSE,ylim=c(-1,1),asp=0,vertex.label=NA)
dev.off()

Create a dictionary of authors and author attributes and values for a journal article using the Scopus API and Python

As an exercise to brush up my Python skills, I decided to tinker around with the Scopus API. Scopus is an online database maintained by Elsevier that records and provides access to information about peer reviewed publications. Not only does Scopus let users search for journal articles based on key words and various other criteria, but the web services also allows users to explore these articles as networks of articles, authors, institutions, and so forth. If you’re interested in risk factors that lead to scholarly publications, publication citations, or impact factors, this is a place to start.

The following code yields a dictionary of author information by requesting content through the abstract retrieval API. This request is made using the Python package requests and parsed using the package BeautifulSoup. Enjoy!

#### Import python packages
import requests
from bs4 import BeautifulSoup


#### Set API key
my_api_key = 'YoUr_ApI_kEy'


#### Abstract retrieval API
# API documentation at http://api.elsevier.com/documentation/AbstractRetrievalAPI.wadl
# Get article info using unique article ID
eid = '2-s2.0-84899659621'
url = 'http://api.elsevier.com/content/abstract/eid/' + eid

header = {'Accept' : 'application/xml',
          'X-ELS-APIKey' : my_api_key}

resp = requests.get(url, headers=header)

print 'API Response code:', resp.status_code # resp.status_code != 200 i.e. API response error

# Write response to file
#with open(eid, 'w') as f:
#    f.write(resp.text.encode('utf-8'))

soup = BeautifulSoup(resp.content.decode('utf-8','ignore'), 'lxml')

soup_author_groups = soup.find_all('author-group')

print 'Number author groups:', len(soup_author_groups)

author_dict = {}

# Traverse author groups
for i in soup_author_groups:

    # Traverse authors within author groups
    for j in i.find_all('author'):

        author_dict.update({j.attrs['auid']:j.attrs}) # Return dictionary of attributes
      
        j.contents.pop(-1) # Pop dicitonary of attributes
 
        # Traverse author contents within author
        for k in j.contents:

            author_dict[j.attrs['auid']].update({k.name : k.contents[0]})
            
print author_list

Plot and Highlight All Clique Triads in VISONE

snaCliques

Description

This post describes how to identify group structures among a network of respondents in VISONE. For a network of selections we identify any cliques involving three or more members. A clique is defined as a group containing three or more members where everyone has chosen everyone else.

Identify all triads

A tried is a network structure containing exactly three members. There are many types of triads. A group of three members where everyone chooses everyone else is a triad (i.e., a clique). A group of three members where two people choose each other and nobody choose the third member is another type of triad. There are 16 unique ways three people can select each other.

Identify all triads. Click the ‘analysis’ tab. Next to ‘task’, select ‘grouping’ from the drop down list of available options. Select ‘cohesiveness’ from the drop down list next to ‘class’. Select the option ‘triad census’ next to ‘measure’. Click ‘analyze’.

Highlight all cliques

Highlight all cliques. Click an empty part of the graph. Press the keys ‘Ctrl’ and ‘a’. Open the attribute manager. Click the ‘link’ button. Click the ‘filter’ button. Select ‘default value’ from the first drop down list. Select ‘triadType300’ from the second drop down list. Select ‘has individual value’ from the third drop down list. Click the radial button ‘replace’. Click ‘select’. Click ‘close’.

From the main VISONE drop down bar, select ‘links’. Click ‘properties’. Click the given color next to ‘color:’. Select ‘rgb’ tab. Set the ‘red’, ‘green’, and ‘blue’ values to 0. Set the ‘alpha’ value to 255. Set ‘opacity’ to 50%. Click the ‘close’ button. Set the ‘width:’ value to 5.0. From the ‘edge properties’ dialogue box, click the ‘apply’ button. Click ‘close.

Reduce visibility of all non-clique selections. Select all nodes and links. Click an empty part of the graph. Press the keys ‘Ctrl’ and ‘a’. Open the attribute manager. Click the ‘link’ button. Click the ‘filter’ button. Select ‘default value’ from the first drop down list. Select ‘triadType300’ from the second drop down list. Select ‘has individual value’ from the third drop down list. Click the radial button ‘remove’. Click ‘select’. Click ‘close’.

From the main VISONE drop down bar, select ‘links’. Click ‘properties’. Click the given color next to ‘color:’. Set ‘opacity’ to 20%. Set the ‘width:’ value to 2.0. From the ‘edge properties’ dialogue box, click the ‘apply’ button. Click ‘close.

How to extract a network subgraph using R

In a previous post I wrote about highlighting a subgraph of a larger network graph. In response to this post, I was asked how extract a subgraph from a larger graph while retaining all essential characteristics among the extracted nodes.

Vinay wrote:

Dear Will,
The code is well written and only highlights the members of a subgraph. I need to fetch them out from the main graph as a separate subgraph (including nodes and edges). Any suggestions please.

Thanks.

Extract subgraph
For a given list of subgraph members, we can extract their essential characteristics (i.e., tie structure and attributes) from a larger graph using the iGraph function induced.subgraph(). For instance,

library(igraph)                   # Load R packages

set.seed(654654)                  # Set seed value, for reproducibility
g <- graph.ring(10)               # Generate random graph object
E(g)$label <- runif(10,0,1)       # Add an edge attribute

# Plot graph
png('graph.png')
par(mar=c(0,0,0,0))
plot.igraph(g)
dev.off()

g2 <- induced.subgraph(g, 1:7)    # Extract subgraph

# Plot subgraph
png('subgraph.png')
par(mar=c(0,0,0,0))
plot.igraph(g2)
dev.off()

Graph
graph

Subgraph
subgraph

Download Twitter Data using JSON in R

Here we consider the task of downloading Twitter data using the R software package RJSONIO.

Screen Shot 2013-05-25 at 6.02.01 PM

Before we can download Twitter data, we’ll need to prove to Twitter that we are in fact authorized to do so. I refer the interested reader to the post Twitter OAuth FAQ for instructions on how to setup an application with dev.twitter.com. Once we’ve setup an application with Twitter we can write some R code to communicate with Twitter about our application and get the data we want. Code from the post Authorize a Twitter Data request in R, specifically the keyValues() function, will be used in this post to handle our authentication needs when requesting data from Twitter.

## Install R packages
install.packages('bitops')
install.packages('digest')
install.packages('RCurl')
install.packages('ROAuth')
install.packages('RJSONIO')


## Load R packages
library('bitops')
library('digest')
library('RCurl')
library('ROAuth')
library('RJSONIO')
library('plyr')


## Set decimal precision
options(digits=22)


## OAuth application values
oauth <- data.frame(consumerKey='YoUrCoNsUmErKeY',consumerSecret='YoUrCoNsUmErSeCrEt',accessToken='YoUrAcCeSsToKeN',accessTokenSecret='YoUrAcCeSsToKeNsEcReT')

keyValues <- function(httpmethod,baseurl,par1a,par1b){  	
# Generate a random string of letters and numbers
string <- paste(sample(c(letters[1:26],0:9),size=32,replace=T),collapse='') # Generate random string of alphanumeric characters
string2 <- base64(string,encode=TRUE,mode='character') # Convert string to base64
nonce <- gsub('[^a-zA-Z0-9]','',string2,perl=TRUE) # Remove non-alphanumeric characters
 
# Get the current GMT system time in seconds 
timestamp <- as.character(floor(as.numeric(as.POSIXct(Sys.time(),tz='GMT'))))
 
# Percent encode parameters 1
#par1 <- '&resources=statuses'
par2a <- gsub(',','%2C',par1a,perl=TRUE) # Percent encode par
par2b <- gsub(',','%2C',par1b,perl=TRUE) # Percent encode par
 
# Percent ecode parameters 2
# Order the key/value pairs by the first letter of each key
ps <- paste(par2a,'oauth_consumer_key=',oauth$consumerKey,'&oauth_nonce=',nonce[1],'&oauth_signature_method=HMAC-SHA1&oauth_timestamp=',timestamp,'&oauth_token=',oauth$accessToken,'&oauth_version=1.0',par2b,sep='')
ps2 <- gsub('%','%25',ps,perl=TRUE) 
ps3 <- gsub('&','%26',ps2,perl=TRUE)
ps4 <- gsub('=','%3D',ps3,perl=TRUE)
 
# Percent encode parameters 3
url1 <- baseurl
url2 <- gsub(':','%3A',url1,perl=TRUE) 
url3 <- gsub('/','%2F',url2,perl=TRUE) 
 
# Create signature base string
signBaseString <- paste(httpmethod,'&',url3,'&',ps4,sep='') 
 
# Create signing key
signKey <- paste(oauth$consumerSecret,'&',oauth$accessTokenSecret,sep='')
 
# oauth_signature
osign <- hmac(key=signKey,object=signBaseString,algo='sha1',serialize=FALSE,raw=TRUE)
osign641 <- base64(osign,encode=TRUE,mode='character')
osign642 <- gsub('/','%2F',osign641,perl=TRUE)
osign643 <- gsub('=','%3D',osign642,perl=TRUE)
osign644 <- gsub('[+]','%2B',osign643,perl=TRUE)
 
return(data.frame(hm=httpmethod,bu=baseurl,p=paste(par1a,par1b,sep=''),nonce=nonce[1],timestamp=timestamp,osign=osign644[1]))
}

Next, we need to figure out what kind of Twitter data we want to download. The Twitter REST API v1.1 Resources site provides a useful outline of what kind of data we can get from Twitter. Just read what is written under the Description sections. As an example, let’s download some user tweets. To do this, we find and consult the specific Resource on the REST API v1.1 page that corresponds with the action we want, here GET statuses/user_timeline. The resource page lists and describes the download options available to the task of getting tweets from a specific user, the thing we want to do, so it’s worth it to the reader to check it out.

Here we download the 100 most recent tweets (and re-tweets) made by the user ‘Reuters’.

## Download user tweets
# Limited to latest 200 tweets
# Specify user name
user <- 'Reuters'
 
kv <- keyValues(httpmethod='GET',baseurl='https://api.twitter.com/1.1/statuses/user_timeline.json',par1a='count=100&include_rts=1&',par1b=paste('&screen_name=',user,sep=''))
 
theData1 <- fromJSON(getURL(paste(kv$bu,'?','oauth_consumer_key=',oauth$consumerKey,'&oauth_nonce=',kv$nonce,'&oauth_signature=',kv$osign,'&oauth_signature_method=HMAC-SHA1&oauth_timestamp=',kv$timestamp,'&oauth_token=',oauth$accessToken,'&oauth_version=1.0','&',kv$p,sep='')))

At this point in the post you should have the 100 most recent tweets made by the user ‘Reuters’ as well as values on several variables recorded by Twitter on each tweet. These are stored in a list data structure. You are now free to do list things to these data to explore what it is you have.

For instance, let’s see the tweets.

theData2 <- unlist(theData1)
names(theData2)
tweets <- theData2[names(theData2)=='text']