RMarkdown Tips and Tricks

An Introduction to RMarkdown

Welcome to the first RMarkdown tutorial! We will be discussing some basic tips and tricks to interface with an RMarkdown document with concepts ranging from beginner to intermediate. In this tutorial I assume you have some base level understanding of R, RStudio, and an awareness of Markdown language. For a comprehensive list of functionality in RMarkdown, please refer to the RStudio cheat sheet.

Creating your RMarkdown file can be done by selecting “File” at the top of RStudio navigator then “New File” > “RMarkdown…”, which will prompt you with an intial interface like this one:

RMarkdown Selection Screen

Here you can pre-define the author and title of the document, which can always be changed later, as well as the output format you want the RMarkdown to produce (also changeable at any time). If you’re like me, you spend a lot of time creating reports for colleagues so we are going to choose “Document” in the left most panel. For the output, there are different pluses and minuses to using HTML, PDF, and Word. I am personally most fond of HTML because it allows for interactive web displays and a ton more customization with the myriad of packages R provides. However, some cultural work spaces will demand the latter two formats, and of these I want to say that Word, being a Microsoft product, is the most difficult to work with so we’ll tackle that one.

The first portion of a typical RMarkdown consists of a “YAML” (yet another markdown language) header where you specify parameters and features.

---
title: "RMarkdown Tutorial"
author: "Richard Hanna"
date: "5/25/2019"
output: 
  word_document:
    reference_docx: Source_Doc.docx
params: 
  ICD9:
    label: "Please Select the ICD9 Code You Wish To Analyze"
    value: "4019"
    input: select
    choices: ["4019", "4280", "42731", "41401", "5849", 
              "25000", "2724", "51881", "5990", "53081"]
---

Don’t worry too much about the params and output sections, we will cover these soon. What’s important to note here is that we have initialized our markdown document with a title, author, and date. A nice tip is that you can automate the date portion with the following code: “r format(Sys.time(), '%d %B, %Y')”. For those unaware the (`), otherwise known as the “backtick,” allows you to write inline code in R as well as other languages, which we’ll soon see. It is at the top left of a standard QWERTY keyboard nestled between the “1” and “Tab” keys.

The next thing to get acquainted with is the “knit” button. You can find this towards the top of the scripting window, it has a big ball of yarn. Simply clicking “knit” will weave together your markdown document and you can watch in the produced. If you click the drop down arrow next to “knit” you can see some additional options including alternative output formats and, most importantly, “Knit with paramters…”. For those who prefer to work in the command line, you can also create your output document using the render function.

Code Chunks and Change

On your initial startup you’ll also see what’s known as a “code chunk” which acts like a self-contained script. You can run this on its own by clicking the play button/green triangle to the top right of the chunk. This first one is the initial setup chunk which will initialize the YAML. I tend to also put any libraries necessary for my project here for convenience.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(RPostgreSQL)

For the purposes of this tutorial I am choosing to load the tidyverse in its entirety, but many professionals will scoff at you for doing so. The tidyverse is a family of packages that are very commonly used in R but you may not use all of them on load and it takes up needless resources to load the whole thing. Here we specifically want access to ggplot2 and dplyr.

This next code chunk allows me to connect to the PostgreSQL database on my local computer where the MIMIC III data is stored. All code chunks start and end with three backticks (```) and the first set have a compliment of curly brackets, {}, where the language and additional features are specified. That’s right, “language” meaning not just R! More on that soon…

If you’ve never used the source() command it’s high time you did. As your code develops and becomes more complex, it will be too burdensome to keep everything on a continuous markdown document that begins to resemble the world’s worst run on sentence I mean how bad would that be it would be terrible it would be nearly bad enough to make viewers stop reading this post- ok sorry, got carried away…

source() references another R Script and runs the whole thing, adding any new variables to your environment. It’s also useful for times like this where I don’t want to directly display my password.

We are also going to use the RPostgreSQL library to connect to a specific database known as MIMIC III to make our data more interesting.

source("mimiciii_con.r")

pg = dbDriver("PostgreSQL")

con = dbConnect(pg, user="postgres", password=password,
                host="localhost", port=5432, dbname="mimic", )

A Quick Word on MIMIC III

The MIMIC III database (Medical Information Mart for Intensive Care III) is a freely accessible, open source critical care database of “deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.” I am choosing to use this database for two reasons:

  • It is always more interesting to use data you are interested in. We could use classic mtcars or iris datasets, but they aren’t very exciting and they’re overused (I’ve even used them here, and likely will again!)
  • Using MIMIC III in combination with the RPostgreSQL will show allow me to show some additional RMarkdown functionality I find extremely helpful and interesting

If you have an interest in the database please explore it on Physionet’s site! There is a bit of some prerequisites to take before gaining access but it is an extremely powerful and often cited resource.

Using Other Languages in RMarkdown: SQL

I was astounded when I learned that I could execute and compile different languages in the same document. Remember that part about specifying {r} when setting up your code chunk? Well here we can specify {sql} in order to incorporate and run SQL commands from the PostgreSQL database we established a link to earlier. If you are not familiar with SQL syntax, not to worry as there are ways around it to keep you in R syntax if that’s your preferred comfort zone (though for large SQL databases you will likely want to run queries from the SQL database manager of your choice).

Here I run a sample query from MIMIC III’s own tutorials page to pull all of the first patient admission times from the database so as not to grab multiple admissions for the same patient. I am also specifying the connection point and telling RStudio to output the contents of the code chunk to a variable called pt_admit_times_first. Due to the way markdown interacts with the output to this site, I am showing the arguments of my curly brackets in a comment within the code chunk.

/* ```{sql, connection = con, output.var = "pt_admit_times_first"} */
SELECT p.subject_id, p.dob, a.hadm_id, p.gender, p.expire_flag, a.admittime,
    MIN (a.admittime) OVER (PARTITION BY p.subject_id) AS first_admittime
FROM admissions a
INNER JOIN patients p
ON p.subject_id = a.subject_id
ORDER BY a.hadm_id, p.subject_id;
/*```*/

dplyr, SQL, and the Happy Medium

If you choose to run your SQL queries through R I think this is the best way to do so. You could also always export your query results to an importable format for R to read into your environment. A third option to show as a proof of concept is using dplyr’s SQL-like syntax to get around both of these entirely, though as I mentioned earlier this comes with the caveat of relying too heavily on R to bear the brunt of the labor which it wasn’t intended to do.

For those unfamiliar with dplyr, it is an incredibly useful library that helps with data massaging and combination through the “grammar of data manipulation.” Check out the dplyr cheat sheet for a more in-depth review of what is available in it.

One of the most important features dplyr enables is the use of “piping,” displayed as %>%. Make a note to yourself that the shortcut for this is CTRL + SHIFT + M, it’ll be your friend trust me. Piping allows for feeding of data through a series of pipe commands that takes out the hassle of constantly needing to repeat writing of the same variables. Compare the following two methods for displaying a simple plot:

data(mtcars)

# non-dplyr method

hp150_cyl4 <- subset(mtcars, mtcars$hp > 150 & mtcars$cyl > 4)
ggplot(hp150_cyl4, aes(x = wt, y = mpg, color = cyl)) + geom_point()

# dplyr method

library(dplyr)

mtcars %>% 
  filter(hp > 150 & cyl > 4) %>% 
  ggplot(aes(x = wt, y = mpg, color = cyl)) + geom_point()

Here we are looking to display the weight, mpg, and cylinders of all cars in the mtcars dataset but only for cars where the horsepower is greater than 150 and having greater than 4 cylinders. One of the key differences between the two methods, other than the %>%, is that the piping allows us to feed all mapping through the commands. You’ll notice we did not have to tell ggplot() where the data was coming from, we could quickly skip to supplying the aes() features. We also did not have to create an entirely new variable through subset() and thereby clog up our environment, we simply were able to use dplyr’s filter() command. Additionally, while it may look like the dplyr variant take up more lines, it is actually all a single command which helps when trying to limit the number of operations you pass to your program.

Back to our task at hand, let’s compare a dplyr method to the previously used SQL code chunk technique on our MIMIC III dataset. Let’s take the output of our previously queries patient first admission times:

SQL example output

And compare it to a synonymous dplyr method similar to what we used on the mtcars dataset:

SQL example output

And voila! We achieved precisely the same output, proving that there are multiple ways to achieve desired results in R (a lesson that will serve you well). With all of these methods in mind it will be up to you to determine what suits your comfort level while achieving optimal efficiency for your program.

Sample Plot and Aggregate with MIMIC III

Next let’s get acquainted with the ICD9 codes that MIMIC III uses in its diagnoses. The ICD9 codes are publicly available and searchable through various sites. Here we use the diagnoses_icd table to pull all distinct codes and aggregate them via COUNT(). We will also join the d_ICD9_diagnoses table using the ICD9 code to pull in the shortened variable names associated with each diagnosis.

/* ```{sql, connection = con, output.var = 'icd9_counts'}  */
SET SEARCH_PATH TO mimiciii;

SELECT dcode.ICD9_Code, count(dcode.ICD9_CODE) as ICD9_Counts, dname.short_title
FROM diagnoses_icd as dcode
INNER JOIN d_icd_diagnoses as dname
ON dcode.icd9_code = dname.icd9_code
GROUP BY dcode.ICD9_CODE, dname.short_title
ORDER BY ICD9_Counts DESC;
/*```*/

We then take the output variable, icd9_counts and use it to create a plot of the top 10 most common diagnoses present in the database.

Make note of the use of “fig.width” in the curly brackets to control for the output graph size.

#```{r, fig.width=12}

common_icd9 <- head(icd9_counts, 10)

common_icd9$short_title <- factor(common_icd9$short_title, levels = unique(common_icd9$short_title))

levels(common_icd9$short_title) <- gsub(" ", "\n", levels(common_icd9$short_title))

common_icd9 %>% 
  ggplot(aes(x = short_title, y = icd9_counts)) + geom_bar(stat = 'identity', fill = "dodgerblue2") +
  geom_text(aes(x=short_title,y=icd9_counts,label=icd9_counts),vjust= -.5) +
  theme_bw() + ggtitle("Counts of Most Common ICD9 Codes") + xlab("ICD9 Code") +
  ylab("Count (n)") + ylim(c(0,25000))
#```

ICD9 Count Output

Introducing Parameters

We’ve finally gotten to the point where we can begin talking about the params: portion of the YAML header we started this tutorial with. For a comprehensive introduction to parameters, I highly encourage reading through RStudio’s tutorial as well.

Parameters are extremely useful when using the same coding architecture but having need to make slight changes to the output. For example, with my work at the pediRES-Q collaborative I often have to write reports for the many participating institutions who contribute to our data. By all intents and purposes, the report is the exact same output for every hospital; all that changes is the focus of data related to that institution. In the simple example I will introduce, I will look into the individual diagnoses that we discovered to be most common in our initial ICD9 analysis.

In the YAML you see a few features created:

params: 
  ICD9:
    label: "Please Select the ICD9 Code You Wish To Analyze"
    value: "4019"
    input: select
    choices: ["4019", "4280", "42731", "41401", "5849", "25000", "2724", "51881", "5990", "53081"]

Here we first declare we are inputting a parameter to make available in our code. We want to label it “ICD9” so that when calling params$__(input param here)__ the option for ICD9 will become available. label will be what appears in the Shiny GUI app when we knit. value is the initial value and input is where we define what type of input parameter this is (i.e. dropdown, text input, numeric input, selection, etc.). choices is what brings this all together, allowing us to specify what specific input we want to include when we run our RMarkdown. Here you can see the ICD9 codes for each of the top 10 most common diseases.

Let’s say our new goal is to analyze the survival of patients, stratified by gender, for specific diseases. So essentially, if we wanted to know the counts of mortality of male vs female patients for non-specific hypertension (the most common of the ten diseases we queried), how would we go about it? First let’s make a new query searching for just patients and their ICD9 codes, joining the short titles for clarity and output this to a variable called pt_icd9:

/*```{sql, connection = con, output.var = "pt_icd9"}*/
SET SEARCH_PATH TO mimiciii;

SELECT dcode.subject_id, dcode.ICD9_Code, dname.short_title
FROM diagnoses_icd as dcode
INNER JOIN d_icd_diagnoses as dname
ON dcode.icd9_code = dname.icd9_code
GROUP BY dcode.ICD9_CODE, dname.short_title, dcode.subject_id;
/*```*/

Next, using R and dplyr syntax, let’s join the pt_admit_times_first data set with our new pt_icd9 query output. We will also use the group_by() and tally() commands from dplyr to group by gender and sum up their totals:

#```{r}
pt_common_codes <- left_join(x = pt_admit_times_first, y = pt_icd9, by = "subject_id")

pt_common_codes <- pt_common_codes %>% 
  filter(icd9_code %in% common_icd9$icd9_code)

pt_common_codes %>% 
  filter(icd9_code == "4019") %>% 
  group_by(gender) %>% 
  tally() %>% 
  ggplot(aes(x = gender, y = n, fill = gender)) + geom_bar(stat='identity') +
  geom_text(aes(y=n,label=n),vjust= -.5) +
  theme_bw() + ggtitle("Gender Spread with NOS Hypertension") + xlab("Gender (M/F)") +
  ylab("Count (n)") + ylim(c(0,15000))
#```

Hypo Gender Output

Here, we can see that we solved two of our deliverables: we stratified by gender and we input “4019” for our ICD9 code… albeit manually. Now, let’s add in the expiration flag to check for mortality and include this by expanding our ggplot() code portion to include the facet_wrap() command so we can see combined plot analysis of mortality. Finally, we parameterize this whole command by replacing “4019” with params$ICD9 based on the YAML we declared in our params header.

#```{r}
pt_common_codes %>% 
  filter(icd9_code == params$ICD9) %>% 
  group_by(gender, expire_flag) %>% 
  tally() %>% 
  ggplot(aes(x = gender, y = n, fill = gender)) + geom_bar(stat='identity') +
  geom_text(aes(y=n,label=n),vjust= -.5) +
  theme_bw() + ggtitle(paste0("Gender Spread with ", params$ICD9)) + xlab("Gender (M/F)") +
  ylab("Count (n)") + ylim(c(0,15000)) + facet_wrap(~expire_flag)
#```

Now when you knit (being sure to select “Knit with Parameters…”) you will be greeted with this Shiny input app:

param knit

Clicking the drop down will show you all of the options you made available in the YAML. Select whatever option you desire to analyze, here I will choose ICD9 code 5849. Knitting will output your new RMarkdown word document with the final plot looking like:

param knit

Parting Thoughts and Conclusions

And there you have it! I hope by now you can see the value in parameterizing reports and how powerful it can be to take full advantage of the RStudio system for markdown documents and outputs. Keep in mind that if outputting to HTML formats you will have way more customization options with the ability to give your users interactive plotting, tabs, highlighting, etc.

For your use I have included my RMarkdown file and Word document output here so that you can see how a pre-defined word template can be used to customize your output. The template is available as well. Just keep in mind you won’t be able to run this .rmd as is since it is connected to my local database.

Thank you for visiting and happy coding!

I want to make a note that I realize not all of these analyses are very valid in the purpose of real statistical meaning. I cannot declare that I have found men to survive acute kidney failure more easily than women as my graph above would indicate, there are many reasons why any of these plots were not the optimal or best plots for drawing conclusions. All analyses in this tutorial were designed for the purposes of walking through markdown techniques and do not reflect anything beyond that.