Omit outdated records after adding amended records

In the juvenile court judges often have the power to amend charges applied to a case. A charge of “INTENT TO DISTRIBUTE” may, for instance, be reduced to something less severe, say “POSSESSION OF PARAPHERNALIA.” When this occurs, a new charge is added to a case record and the old charge is, in some cases, retained. This short post presents R code to omit outdated charges after amended charges have been added.

Data for this example consist of charges applied to a single court record. These data are provided below:

charge=c("Count 1", "Count 2", "Amended", "Count 3", "Amended")
section=c("21-5807(a)(3)", "21-5807(a)(3)", "21-5801(b)(4)", "21-5807(a)(3)", "21-5801(b)(4)")
date=c("09/13/15", "09/20/15", "04/04/16", "10/03/15", "04/04/16")
title=c("BURGLARY OF MOTOR VEHICLE", "BURGLARY OF MOTOR VEHICLE", "THEFT", "BURGLARY OF MOTOR VEHICLE", "THEFT")
acs=c("", "", "", "", "")
drug=c("", "", "", "", "")
pl=c("", "", "", "", "")
finding=c("DISMISS BY PROS", "", "DISMISS BY PROS", "", "DISMISS BY PROS")
tp=c("F", "F", "M", "F", "M")
lvl=c("9", "9", "A", "9", "A")
pn=c("N", "N", "N", "N", "N")
sentence_date=c("", "", "04/04/2016", "", "04/04/2016")

df <- data.frame(charge, section, date, title, acs, drug, finding, tp, lvl, pn, sentence_date,stringsAsFactors=FALSE)

The goal here is to remove every charge that is followed by an amended charge. The following dataset illustrates the desired result:

charge=c("Count 1", "Amended", "Amended")
section=c("21-5807(a)(3)", "21-5801(b)(4)", "21-5801(b)(4)")
date=c("09/13/15", "04/04/16", "04/04/16")
title=c("BURGLARY OF MOTOR VEHICLE", "THEFT", "THEFT")
acs=c("", "", "")
drug=c("", "", "")
pl=c("", "", "")
finding=c("DISMISS BY PROS", "DISMISS BY PROS", "DISMISS BY PROS")
tp=c("F", "M", "M")
lvl=c("9", "A", "A")
pn=c("N", "N", "N")
sentence_date=c("", "04/04/2016", "04/04/2016")

df <- data.frame(charge, section, date, title, acs, drug, finding, tp, lvl, pn, sentence_date,stringsAsFactors=FALSE)

One way to achieve this new dataset is to identify all charges that are followed by an amended charge and extract all charges except these original charges.

First, the positions of the updated charges are identified. The following expression identifies the updated charges as TRUE and all other charges as FALSE.

c(df$charge=="Amended")

Second, the positions of the updated charges are used to identify the positions of the original charges that were amended. This is accomplished by recognizing a pattern in the charges themselves. Updated charges in this dataset are always preceded by their original charge. Since the updated charges were identified as TRUE (and the non-updated charges identified as FALSE), shifting these values by one position identifies the original charges.

c(c(df$charge=="Amended")[-1], c(df$charge=="Amended")[1])

Finally, the positions of the original charges (i.e., the charges followed by the amended ones) are used to extract the desired dataset. The desired dataset consists of every charge not followed by an updated charge.

df[!c(c(df$charge=="Amended")[-1], c(df$charge=="Amended")[1]),]

Enroll in SOCIO460 Youth and Crime!

Looking for an engaging course in criminology this summer? Interested in the theories and facts of youth crime and crime control? Want to experience the youth justice field firsthand? Then have I got a course for you!

SOCIO460 Youth and Crime offers theories, facts and firsthand experiences with the youth justice field. Through this course you will:

  • Learn facts and explore trends in delinquency, status offenses and victimization.
  • Explore theories on crime and crime control, everything from rational choice to intersectional feminist thought.
  • Discuss pressing social issues with juvenile court and care professionals.
  • Sift through juvenile court case records and examine the workings of a Kansas juvenile court.

Special guests include: Judge Delia M. York, Evan Mitchel (LMSW), ISO Kristy Blagg and Kristin B. Kelly (MA).

socio460 flyer v3

Print friendly flyer: socio460 flyer

Plot Network Data in R with iGraph

I recently had a conversation on Twitter about a plot I made a while back. Recall, the plot showed my Twitter network, my friends and my friend’s friends.

Here’s the Twitter thread:

And here’s the R code:

#### Load R libraries
library("iGraph")

#### Load edgelist
r <- read.csv(file="edgelist_friends.csv-03-25.csv",header=TRUE,stringsAsFactors=FALSE)[,-1]

#### Convert to graph object
gr <- graph.data.frame(r,directed=TRUE)

#### gr
# Describe graph
summary(gr)
ecount(gr) # Edge count
vcount(gr) # Node count
diameter(gr) # Network diameter
farthest.nodes(gr) # Nodes furthest apart
V(gr)$indegree = degree(gr,mode="in") # Calculate indegree

#### Plot graph
E(gr)$color = "gray"
E(gr)$width = .5
E(gr)$arrow.width = .25
V(gr)$label.color = "black"
V(gr)$color = "dodgerblue"
V(gr)$size = 4

set.seed(40134541)
l <- layout.fruchterman.reingold(gr)

pdf("network_friends_plot.pdf")
plot(gr,layout=l,rescale=TRUE,axes=FALSE,ylim=c(-1,1),asp=0,vertex.label=NA)
dev.off()

Create a dictionary of authors and author attributes and values for a journal article using the Scopus API and Python

As an exercise to brush up my Python skills, I decided to tinker around with the Scopus API. Scopus is an online database maintained by Elsevier that records and provides access to information about peer reviewed publications. Not only does Scopus let users search for journal articles based on key words and various other criteria, but the web services also allows users to explore these articles as networks of articles, authors, institutions, and so forth. If you’re interested in risk factors that lead to scholarly publications, publication citations, or impact factors, this is a place to start.

The following code yields a dictionary of author information by requesting content through the abstract retrieval API. This request is made using the Python package requests and parsed using the package BeautifulSoup. Enjoy!

#### Import python packages
import requests
from bs4 import BeautifulSoup


#### Set API key
my_api_key = 'YoUr_ApI_kEy'


#### Abstract retrieval API
# API documentation at http://api.elsevier.com/documentation/AbstractRetrievalAPI.wadl
# Get article info using unique article ID
eid = '2-s2.0-84899659621'
url = 'http://api.elsevier.com/content/abstract/eid/' + eid

header = {'Accept' : 'application/xml',
          'X-ELS-APIKey' : my_api_key}

resp = requests.get(url, headers=header)

print 'API Response code:', resp.status_code # resp.status_code != 200 i.e. API response error

# Write response to file
#with open(eid, 'w') as f:
#    f.write(resp.text.encode('utf-8'))

soup = BeautifulSoup(resp.content.decode('utf-8','ignore'), 'lxml')

soup_author_groups = soup.find_all('author-group')

print 'Number author groups:', len(soup_author_groups)

author_dict = {}

# Traverse author groups
for i in soup_author_groups:

    # Traverse authors within author groups
    for j in i.find_all('author'):

        author_dict.update({j.attrs['auid']:j.attrs}) # Return dictionary of attributes
      
        j.contents.pop(-1) # Pop dicitonary of attributes
 
        # Traverse author contents within author
        for k in j.contents:

            author_dict[j.attrs['auid']].update({k.name : k.contents[0]})
            
print author_list

Jitter scatterplot value positions with value labels in R using ggplot2

The following R code creates a scatterplot using ggplot2. Points on this plot are represented by identification numbers. The jitter option removes overlap between these plotted values.

#### Attach R libraries
library("ggplot2")


#### Generate random data set 
theData <- data.frame(id=1:20, xVar=sample(1:4, 20, replace=TRUE), yVar=sample(1:4, 20, replace=TRUE))


#### Plot scatterplot
set.seed(seed=658672)
p <- ggplot(theData)
p + theme_bw() + geom_text(aes(x=xVar,y=yVar,label=id),size=3,position=position_jitter(w=.2, h=.2))

“One Weird Trick” to Recover Suppressed Counts from CDC’s WONDER

The Centers for Disease Control and Prevention’s Wide-ranging Online Data for Epidemiological Research (WONDER) data retrieval system provides access to many types of public health information. Mortality and fertility counts for multiple years across standard geographical subdivisions broken down by race, gender, 5-year age groups, etc. are just some of the data available through this system. Though a vast array of data are available through WONDER, counts that fall between 0 and 9 are replaced with the word “Suppressed,” generating missing values, and hindering research agendas. This post describes a new method to recover some of these suppressed counts.

Method

It’s really quite simple to recover many counts suppressed by CDC’s WONDER, it just takes, as the spam advertisements claim, this “one weird trick.” And, as expected, the trick involves algebra any 5th grader could do:

what you want = a lot of what they have – (a lot of what they have – what you want)

Infant Mortality Example

As an example, let’s query infant mortality rates for all US counties in 2013 using the default method. Go to CDC’s WONDER home page and click the link Multiple cause of death (Detailed Mortality). Click the Data Request link from the Current Multiple Cause of Death Data section. Scroll down and click the “I Agree” button to agree to the terms and conditions for accessing these data. To make this data request, most of the default settings are fine, but let’s select the additional following options:

  • From the Organize table layout section, select “County” from the And By menu
  • From the Select demographics section, click the radial button next to Single-Year Ages
  • From the Select demographics section, select “< 1 year” from the Pick between list
  • From the Select year and month section, select “+ 2013” from the Year/Month list
  • From the Other options section, check the box next to Show Zero Values
  • From the Other options section, check the box next to Show Suppressed Values
  • From the Other options section, select “4” from the Precision menu

Click the Send button and CDC’s WONDER will return infant mortality rates for all US counties in 2013, but not all of them. Most of the counts returned, as you’ll notice, are suppressed. Taking counties as our unit of analysis gives us a response rate of approximately 14.86%. Figure 1 shows a map of these data.

Figure 1: Infant Mortality Rates among US Counties in 2013 (n=3142)

cdcMortalityInfantcdc2013n2

In our effort to get more counts, we repeat the same steps taken to construct Figure 1, but this time we select all available years: “+ 1999”, “+ 2000”, …, “+ 2013”. This action gives us the “a lot of what they have” part of the equation. Figure 2 shows infant mortality rates for all US counties over the years 1999 to 2013. The response rate among these counties is about 81.51%.

Figure 2: Infant Mortality Rates among US Counties years 1999 to 2013 (n=3142)

cdcMortalityInfantcdc19992013V2

Recall, the “one weird trick” involves subtracting the “(a lot of what they have – what you want)” part from the “a lot of what they have” part, which we got in the construction of Figure 2. To get the “(a lot of what they have – what you want)” piece of the equation, we accept the defaults settings in WONDER with the following exceptions:

  • From the Organize table layout section, select “County” from the And By menu
  • From the Select demographics section, click the radial button next to Single-Year Ages
  • From the Select demographics section, select “< 1 year” from the Pick between list
  • From the Select year and month section, select “+ 1999”, “+ 2000”, …, “+ 2012” from the Year/Month list
  • From the Other options section, check the box next to Show Zero Values
  • From the Other options section, check the box next to Show Suppressed Values
  • From the Other options section, select “4” from the Precision menu

Figure 3 shows infant mortality rates for all US counties over the years 1999 to 2012. The response rate among these counties is about 80.43%.

Figure 3: Infant Mortality Rates among US Counties years 1999 to 2012 (n=3142)

cdcMortalityInfantcdc19992012V2

To recover suppressed infant mortality counts among US counties in 2013, all we need to do is take the difference of the counts used to construct Figure 3 from those used to construct Figure 2. This method can greatly improve upon the naive approach and yield a higher response rate (80.43% as compared to 14.86%). Due to data use restrictions, no actual differences were taken in the development and presentation of this method. It’s assumed the new 2013 response rate, the one discussed here, will match the response rate of the counties across the years 1999 to 2012. The actual response rate has not been calculated. Because of these restrictions, we fail to plot and present a map of recovered infant mortality counts among US counties in 2013.

Confidentiality and Data Use Restrictions

The CDC, in case you’re wondering, suppresses all counts between 0 and 9 to ensure confidentiality and protect personal privacy (for more on this, see Assurance of Confidentiality). Recall from the terms and conditions, it is against the law to use these data in certain ways. Things not done on this post, include:

  • “present or publish death counts of 9 or fewer or death rates based on counts nine or fewer (in figures, graphs, maps, tables, etc.)”
  • “attempt to learn the identity of any person or establishment included in these data”
  • Disclose or make “other use of the identity of any person or establishment discovered inadvertently”

The method described and data used in this post are provided to support “health statistical reporting and analysis only.”

Convert Qualitative Codes into a Binary Response Matrix in R

Content analysis is a qualitative method for identifying themes among a collection of documents. The themes themselves are either derived from the content reviewed or specified a priori according to some established theoretical perspective or set or research questions. Documents are read, content is considered and themes (represented as letter “codes”) are applied. It’s not uncommon for documents to exhibit multiple themes. In this way, results from a content analysis are not unlike responses to the “select all that apply” type questions found in survey research. Once a set of documents is coded, it’s often of interest to know the proportion of times the codes were observed.

The following R code transforms codes on a set of documents, stored as a list of lists, into a binary matrix.

#### Load R packages
library("XLConnect")
library("stringr")
library("vcd")


#### Working directory
getwd()
setwd("C:/Users/chernoff/Desktop")


#### Read data
theData <- readWorksheet(loadWorkbook("dummyData.xlsx"),sheet="Sheet1",header=TRUE)


#### Parse codes
theData2 <- str_extract_all(theData$codes,"[a-zA-Z]")

codeList <- unique(unlist(theData2))

theData2 <- lapply(X=theData2,function(X) as.data.frame(matrix(as.numeric(codeList%in%X),ncol=length(codeList),dimnames=list(c(),codeList))))
theData3 <- do.call(rbind,theData2)

theData4 <- cbind(theData,theData3)

If we print the data to the screen, we see themes are represented as binary variables, where 1 indicates a theme was observed and 0 indicates it was not.

binaryMatrix

Once the data are organized as a binary matrix, we can calculate column totals colSums(theData4[,codeList]) to see which themes were more popular and which ones were least popular.

binaryMatrixColSums

And lastly, if we want to get fancy, we can represent the data using a mosaic plot.

totals <- table(theData4$a,theData4$b,theData4$c,dnn=c("A","B","C"))

png("mosaicPlot.png")
mosaicplot(totals,sort=3:1,col=hcl(c(120,10)),"Mosaic Plot")
dev.off()

mosaicPlot

A mosaic plot shows the relative proportion of each theme compared to one or more of the other themes. The main two rows show the levels of theme B. The main two columns represent theme C’s levels. And the two columns within the two main columns represent the levels of A. By default, the label for theme A is not shown. The cell in the upper left-hand corner, i.e. cell (1,1), shows there were some but not many documents without any themes. Cells (1,3) and (1,4) show there were an equal number of documents with theme C as there were that involved themes A and C combined. The remaining cell in this first row (1,2) shows there were more documents pertaining solely to theme A than all other document types not containing theme B. Interpretations of the remaining rectangles follow similarly.