Convert Qualitative Codes into a Binary Response Matrix in R

Content analysis is a qualitative method for identifying themes among a collection of documents. The themes themselves are either derived from the content reviewed or specified a priori according to some established theoretical perspective or set or research questions. Documents are read, content is considered and themes (represented as letter “codes”) are applied. It’s not uncommon for documents to exhibit multiple themes. In this way, results from a content analysis are not unlike responses to the “select all that apply” type questions found in survey research. Once a set of documents is coded, it’s often of interest to know the proportion of times the codes were observed.

The following R code transforms codes on a set of documents, stored as a list of lists, into a binary matrix.

#### Load R packages

#### Working directory

#### Read data
theData <- readWorksheet(loadWorkbook("dummyData.xlsx"),sheet="Sheet1",header=TRUE)

#### Parse codes
theData2 <- str_extract_all(theData$codes,"[a-zA-Z]")

codeList <- unique(unlist(theData2))

theData2 <- lapply(X=theData2,function(X),ncol=length(codeList),dimnames=list(c(),codeList))))
theData3 <-,theData2)

theData4 <- cbind(theData,theData3)

If we print the data to the screen, we see themes are represented as binary variables, where 1 indicates a theme was observed and 0 indicates it was not.


Once the data are organized as a binary matrix, we can calculate column totals colSums(theData4[,codeList]) to see which themes were more popular and which ones were least popular.


And lastly, if we want to get fancy, we can represent the data using a mosaic plot.

totals <- table(theData4$a,theData4$b,theData4$c,dnn=c("A","B","C"))

mosaicplot(totals,sort=3:1,col=hcl(c(120,10)),"Mosaic Plot")


A mosaic plot shows the relative proportion of each theme compared to one or more of the other themes. The main two rows show the levels of theme B. The main two columns represent theme C’s levels. And the two columns within the two main columns represent the levels of A. By default, the label for theme A is not shown. The cell in the upper left-hand corner, i.e. cell (1,1), shows there were some but not many documents without any themes. Cells (1,3) and (1,4) show there were an equal number of documents with theme C as there were that involved themes A and C combined. The remaining cell in this first row (1,2) shows there were more documents pertaining solely to theme A than all other document types not containing theme B. Interpretations of the remaining rectangles follow similarly.