“One Weird Trick” to Recover Suppressed Counts from CDC’s WONDER

The Centers for Disease Control and Prevention’s Wide-ranging Online Data for Epidemiological Research (WONDER) data retrieval system provides access to many types of public health information. Mortality and fertility counts for multiple years across standard geographical subdivisions broken down by race, gender, 5-year age groups, etc. are just some of the data available through this system. Though a vast array of data are available through WONDER, counts that fall between 0 and 9 are replaced with the word “Suppressed,” generating missing values, and hindering research agendas. This post describes a new method to recover some of these suppressed counts.


It’s really quite simple to recover many counts suppressed by CDC’s WONDER, it just takes, as the spam advertisements claim, this “one weird trick.” And, as expected, the trick involves algebra any 5th grader could do:

what you want = a lot of what they have – (a lot of what they have – what you want)

Infant Mortality Example

As an example, let’s query infant mortality rates for all US counties in 2013 using the default method. Go to CDC’s WONDER home page and click the link Multiple cause of death (Detailed Mortality). Click the Data Request link from the Current Multiple Cause of Death Data section. Scroll down and click the “I Agree” button to agree to the terms and conditions for accessing these data. To make this data request, most of the default settings are fine, but let’s select the additional following options:

  • From the Organize table layout section, select “County” from the And By menu
  • From the Select demographics section, click the radial button next to Single-Year Ages
  • From the Select demographics section, select “< 1 year” from the Pick between list
  • From the Select year and month section, select “+ 2013” from the Year/Month list
  • From the Other options section, check the box next to Show Zero Values
  • From the Other options section, check the box next to Show Suppressed Values
  • From the Other options section, select “4” from the Precision menu

Click the Send button and CDC’s WONDER will return infant mortality rates for all US counties in 2013, but not all of them. Most of the counts returned, as you’ll notice, are suppressed. Taking counties as our unit of analysis gives us a response rate of approximately 14.86%. Figure 1 shows a map of these data.

Figure 1: Infant Mortality Rates among US Counties in 2013 (n=3142)


In our effort to get more counts, we repeat the same steps taken to construct Figure 1, but this time we select all available years: “+ 1999”, “+ 2000”, …, “+ 2013”. This action gives us the “a lot of what they have” part of the equation. Figure 2 shows infant mortality rates for all US counties over the years 1999 to 2013. The response rate among these counties is about 81.51%.

Figure 2: Infant Mortality Rates among US Counties years 1999 to 2013 (n=3142)


Recall, the “one weird trick” involves subtracting the “(a lot of what they have – what you want)” part from the “a lot of what they have” part, which we got in the construction of Figure 2. To get the “(a lot of what they have – what you want)” piece of the equation, we accept the defaults settings in WONDER with the following exceptions:

  • From the Organize table layout section, select “County” from the And By menu
  • From the Select demographics section, click the radial button next to Single-Year Ages
  • From the Select demographics section, select “< 1 year” from the Pick between list
  • From the Select year and month section, select “+ 1999”, “+ 2000”, …, “+ 2012” from the Year/Month list
  • From the Other options section, check the box next to Show Zero Values
  • From the Other options section, check the box next to Show Suppressed Values
  • From the Other options section, select “4” from the Precision menu

Figure 3 shows infant mortality rates for all US counties over the years 1999 to 2012. The response rate among these counties is about 80.43%.

Figure 3: Infant Mortality Rates among US Counties years 1999 to 2012 (n=3142)


To recover suppressed infant mortality counts among US counties in 2013, all we need to do is take the difference of the counts used to construct Figure 3 from those used to construct Figure 2. This method can greatly improve upon the naive approach and yield a higher response rate (80.43% as compared to 14.86%). Due to data use restrictions, no actual differences were taken in the development and presentation of this method. It’s assumed the new 2013 response rate, the one discussed here, will match the response rate of the counties across the years 1999 to 2012. The actual response rate has not been calculated. Because of these restrictions, we fail to plot and present a map of recovered infant mortality counts among US counties in 2013.

Confidentiality and Data Use Restrictions

The CDC, in case you’re wondering, suppresses all counts between 0 and 9 to ensure confidentiality and protect personal privacy (for more on this, see Assurance of Confidentiality). Recall from the terms and conditions, it is against the law to use these data in certain ways. Things not done on this post, include:

  • “present or publish death counts of 9 or fewer or death rates based on counts nine or fewer (in figures, graphs, maps, tables, etc.)”
  • “attempt to learn the identity of any person or establishment included in these data”
  • Disclose or make “other use of the identity of any person or establishment discovered inadvertently”

The method described and data used in this post are provided to support “health statistical reporting and analysis only.”


Convert Qualitative Codes into a Binary Response Matrix in R

Content analysis is a qualitative method for identifying themes among a collection of documents. The themes themselves are either derived from the content reviewed or specified a priori according to some established theoretical perspective or set or research questions. Documents are read, content is considered and themes (represented as letter “codes”) are applied. It’s not uncommon for documents to exhibit multiple themes. In this way, results from a content analysis are not unlike responses to the “select all that apply” type questions found in survey research. Once a set of documents is coded, it’s often of interest to know the proportion of times the codes were observed.

The following R code transforms codes on a set of documents, stored as a list of lists, into a binary matrix.

#### Load R packages

#### Working directory

#### Read data
theData <- readWorksheet(loadWorkbook("dummyData.xlsx"),sheet="Sheet1",header=TRUE)

#### Parse codes
theData2 <- str_extract_all(theData$codes,"[a-zA-Z]")

codeList <- unique(unlist(theData2))

theData2 <- lapply(X=theData2,function(X) as.data.frame(matrix(as.numeric(codeList%in%X),ncol=length(codeList),dimnames=list(c(),codeList))))
theData3 <- do.call(rbind,theData2)

theData4 <- cbind(theData,theData3)

If we print the data to the screen, we see themes are represented as binary variables, where 1 indicates a theme was observed and 0 indicates it was not.


Once the data are organized as a binary matrix, we can calculate column totals colSums(theData4[,codeList]) to see which themes were more popular and which ones were least popular.


And lastly, if we want to get fancy, we can represent the data using a mosaic plot.

totals <- table(theData4$a,theData4$b,theData4$c,dnn=c("A","B","C"))

mosaicplot(totals,sort=3:1,col=hcl(c(120,10)),"Mosaic Plot")


A mosaic plot shows the relative proportion of each theme compared to one or more of the other themes. The main two rows show the levels of theme B. The main two columns represent theme C’s levels. And the two columns within the two main columns represent the levels of A. By default, the label for theme A is not shown. The cell in the upper left-hand corner, i.e. cell (1,1), shows there were some but not many documents without any themes. Cells (1,3) and (1,4) show there were an equal number of documents with theme C as there were that involved themes A and C combined. The remaining cell in this first row (1,2) shows there were more documents pertaining solely to theme A than all other document types not containing theme B. Interpretations of the remaining rectangles follow similarly.

Batch Convert HTML files to PDF with wkhtmltopdf in Mac OSX

The open source (LGPL) command line tool wkhtmltopdf can quickly and robustly transform HTML files to PDF. To transform a set of files, navigate to the desired folder and enter into the Terminal app the syntax:

for f in *F1F*.html; do wkhtmltopdf -g -s Letter --no-background "$f" "${f/_*_/_}.pdf";done

The syntax takes every HTML file in a folder and generates from it a PDF file. The wkhtmltopdf command options modify the files generated: -g sets greyscale, -s Letter sets paper size, and --no-background omits background content. For more options check out the wkhtmltopdf auto-generated manual.

Plot and Highlight All Clique Triads in VISONE



This post describes how to identify group structures among a network of respondents in VISONE. For a network of selections we identify any cliques involving three or more members. A clique is defined as a group containing three or more members where everyone has chosen everyone else.

Identify all triads

A tried is a network structure containing exactly three members. There are many types of triads. A group of three members where everyone chooses everyone else is a triad (i.e., a clique). A group of three members where two people choose each other and nobody choose the third member is another type of triad. There are 16 unique ways three people can select each other.

Identify all triads. Click the ‘analysis’ tab. Next to ‘task’, select ‘grouping’ from the drop down list of available options. Select ‘cohesiveness’ from the drop down list next to ‘class’. Select the option ‘triad census’ next to ‘measure’. Click ‘analyze’.

Highlight all cliques

Highlight all cliques. Click an empty part of the graph. Press the keys ‘Ctrl’ and ‘a’. Open the attribute manager. Click the ‘link’ button. Click the ‘filter’ button. Select ‘default value’ from the first drop down list. Select ‘triadType300’ from the second drop down list. Select ‘has individual value’ from the third drop down list. Click the radial button ‘replace’. Click ‘select’. Click ‘close’.

From the main VISONE drop down bar, select ‘links’. Click ‘properties’. Click the given color next to ‘color:’. Select ‘rgb’ tab. Set the ‘red’, ‘green’, and ‘blue’ values to 0. Set the ‘alpha’ value to 255. Set ‘opacity’ to 50%. Click the ‘close’ button. Set the ‘width:’ value to 5.0. From the ‘edge properties’ dialogue box, click the ‘apply’ button. Click ‘close.

Reduce visibility of all non-clique selections. Select all nodes and links. Click an empty part of the graph. Press the keys ‘Ctrl’ and ‘a’. Open the attribute manager. Click the ‘link’ button. Click the ‘filter’ button. Select ‘default value’ from the first drop down list. Select ‘triadType300’ from the second drop down list. Select ‘has individual value’ from the third drop down list. Click the radial button ‘remove’. Click ‘select’. Click ‘close’.

From the main VISONE drop down bar, select ‘links’. Click ‘properties’. Click the given color next to ‘color:’. Set ‘opacity’ to 20%. Set the ‘width:’ value to 2.0. From the ‘edge properties’ dialogue box, click the ‘apply’ button. Click ‘close.

How to extract a network subgraph using R

In a previous post I wrote about highlighting a subgraph of a larger network graph. In response to this post, I was asked how extract a subgraph from a larger graph while retaining all essential characteristics among the extracted nodes.

Vinay wrote:

Dear Will,
The code is well written and only highlights the members of a subgraph. I need to fetch them out from the main graph as a separate subgraph (including nodes and edges). Any suggestions please.


Extract subgraph
For a given list of subgraph members, we can extract their essential characteristics (i.e., tie structure and attributes) from a larger graph using the iGraph function induced.subgraph(). For instance,

library(igraph)                   # Load R packages

set.seed(654654)                  # Set seed value, for reproducibility
g <- graph.ring(10)               # Generate random graph object
E(g)$label <- runif(10,0,1)       # Add an edge attribute

# Plot graph

g2 <- induced.subgraph(g, 1:7)    # Extract subgraph

# Plot subgraph



Managing Content with Nodes and Links: Why I won’t use NVivo 10 to Prepare for My Preliminary Exam

The preliminary exam in almost any graduate program requires the organization of a tremendous amount of reading material. One of the preliminary exams I’m taking, the one in social science research methods, for instance, requires a familiarity with over 40 unique sources, spanning 18 distinct topics. That’s a reading list of over 4 pages in length! Furthermore, the exam requires that I write three ten page papers over the course of three consecutive days regarding three unknown methodology questions drawing from potentially any and all of the materials on the reading list. It’s a lot to read and a lot to recall. To organize the information in these reading materials, and speed up my ability to recall topics, quotes and my own notes on them, I tried using a qualitative data analysis (QDA) software system, a preparation strategy which I really wish had worked better.

Generally speaking, QDA software systems allow researchers to organize qualitative data. These suites allow researchers to select content across a number of documents and classify all their selections under different themes. In a slightly more technically way, content across many documents can be selected by a researcher using QDA software, content which they can then in turn associate with different nodes (i.e., themes), both to which they can apply numerous annotations. Content and themes created in this manner can then be relabeled and nested (or unnested) based on the sense-making of the person doing the research. Content selected in this way is recalled by simply double-clicking on the node associated with it. The result of all this work is a spidery network of content, which, as an organization method, offers some attractive qualities.

Organizing content, themes, and annotations by nodes and links is potentially a convenient, timesaving data organization strategy.

  • No longer must researchers copy and paste important quotes from their documents into separate files. Instead they can work directly on their documents and tag, with a flick of the mouse, whatever they think is important.
  • No longer must researchers work with long note outlines. Content is important, of course, but, when trying to make sense of a large collection of identified themes, content items can at times get in the way. Nodes and content are generally shown in QDA packages using separate windows, which simplifies the outline and allows for easier theme management. In this way a researcher can spend more time thinking about how their themes relate to one another and only look at quotes and annotations when they actually need them.
  • No longer must researchers keep track of page numbers. Each content item is tied to the original page within the document it was found. Page numbers, in this way, need only be written out by a researcher when they themselves are actually ready to write about the content referenced. Front loading page numbers is a lot of work and needless work when the content items identified do not make it into the working document.

QDA software packages are a promising way in which researchers can spend more time reading and thinking about their content than explicitly managing it.

To investigate QDA, I looked into QSRI’s software suite NVivo 10. Plenty of great tutorials exist on how to use NVivo 10, put out both by QSRI and members of the NVivo community. For this reason, I’m going to spend more time talking about what I didn’t like about NVivo 10 than how to specifically do certain things with it. Suggestions are also provided as to how NVivo 10 could potentially be made more effective.

Node Matrix Column Width Adjustments Change Other Column Widths



  • Resizing the column ‘Created On’ resized all other columns as well, most notably the ‘References’ column. These others columns themselves now need to be corrected, which requires the user to do extra work. This needs to be fixed.

Node Column Names Misalign after Adjusting Column Widths in a Narrow UI Frame



  • Resizing the column ‘Name’ misaligned the column names of the node matrix. Because of this, tracking columns names now falls to the user, which is work, extra work they might prefer not to do. This needs to be fixed.

Can’t Easily Retrieve Content From a PDF File

Select and Associate Content with a Node
Screen Shot 2013-08-06 at 7.38.19 PM

Open Node Frame
Screen Shot 2013-08-06 at 7.40.28 PM

  • Content selected by ‘Region’ isn’t shown when opening an ‘Open Node…’ frame. Instead, the region coordinates of the selected content are shown. The point of retrieving content is to actually get the content and not a list of instructions as to where it is the content is located. Content retrieval executed in this way actually passes the burden of content retrieval to the user.

View Selected Content

Screen Shot 2013-08-06 at 7.56.18 PM

  • Selected content can be viewed in the ‘Open Node…’ frame under the ‘PDF’ tab. Even though the unselected part of the document is masked, the content is only found by scrolling around the document itself. This has the benefit of connecting the content with the page it’s on, but getting the content still requires the user to do the work. Is it possible, upon opening an ‘Open Node…’ frame, to generate and display an image of the region selected instead of the coordinates at which it is located? Or, alternatively, is it possible to make available the tools usually available when working with image files when working with PDF files?

Can’t ‘Insert Row’ Content with a PDF File

  • PDF files are not treated the same way as image files, though working with PDF files might be easier if they were.

Working with an Image File

Screen Shot 2013-08-06 at 9.56.35 PM

  • When working with an image file, selected regions can be inserted into a table using the ‘Insert Row’ option. Doing this allows a user to more or less overlap a comment with a selected image region, a comment which can then be connected to a node and recalled as text when needed. In this way, a scanned document can be coded, which, in a round about way, can include a PDF file, when the PDF file itself has first been exported to collection of image files. This conversion process, however, is a lot of work when working with multiple PDF files and when the PDF files themselves each contain multiple pages. Is it possible to include a bulk file conversion function with the software suite?

Annotations Not Displayed In-Line With Retrieved Content

Screen Shot 2013-08-06 at 10.05.18 PM

  • Content annotations are retrieved and shown when opening an ‘Open Node…” frame. However, separating annotations from the content items to which they refer requires the user to do added work when transferring their content to a working document. Instead of copying and pasting at once all retrieved content from an ‘Open Node…’ frame to a working document, the user is required to intersperse, in some manual piece-by-piece way, their annotations among the content items they transferred. It falls to the user to copy an annotation, search a working document for the annotation’s associated content item, and paste it into place. Why not give users the option, when viewing a node in an ‘Open Node…’ frame, to have their annotations inserted in-line with the content items themselves?

Can’t Quickly Unnode Content

  • Removing a selected content item from a node requires the user to first find the original content item they selected. This isn’t so bad, since the software mostly keeps track of this through nodes, but it still requires the user to find and select the item they originally selected, which can require extra work on behalf of the user. Why not let users deselect content from nodes through the ‘Open Node…’ frame?

Have to Readjust the UI for Every Document Opened

Adjusted Workspace

New Workspace

  • Every time a user opens a document they must adjust the UI so as to work with it. With a lot of files, this means doing the following sequence of steps repeatedly: ‘Click to edit’, move and readjust the frame, readjust, when working with images, the region-content table, zoom in on the file viewed, reselect ‘Nodes’ from the navigation pane. This is too much work. Why not make newly opened documents default to the last configuration specified by the user? Or, alternatively, why not implement a tab system of sorts where files can be opened into the user adjusted workspace?

To be fair, NVivo 10 does everything I want a QDA application todo. It lets me select content, associate content with different nodes, nest and unnest nodes, modify node labels, annotate content and nodes, and, most importantly, recall everything done with simply the click of a mouse. However, NVivo 10 falls short regarding content selection options and UI design, which in turn create extra work for the user. One scanned PDF document, for instance, can, when converted to a collection of images, require a user to manage about thirty separate files, files which require the user to do a lot of unnecessary software fidgeting. This fidgeting with NVivo adds up fast and quickly outweighs the productivity gains had in using the software. At this point in its development, I don’t recommended NVivo 10 to most other graduate students looking for an effective means by which to better manage their preliminary exam materials.