United States Gun Violence

United States Gun Violence

Introduction

Disclaimer

I reserve any views or opinions expressed by this post as my own, realizing the contentious nature of the subject matter. The intention of this write-up is to present facts and provide a sample way of displaying them. Conclusions and opinions based on this are at the responsibility of independent discretion.

Purpose

In the wake of numerous mass shooting incidents in the United States towards the end of 2019, I decided to take a look at how I could synthesize whatever data might exist and display it in a “living” dashboard tool. This tool would serve as a present-day snapshot of demographics, frequency, and geographic location of the incidents in a descriptive manner which could allow for future expansion.

My personal wish for this was to learn more about gun control and reporting in this country and to apply this interest to new R packages and techniques I otherwise wouldn’t use typically. This dashboard allowed me to learn leaflet, the mapping functionality in highcharter, expand my knowledge of rvest, and touch on purrr and DT which have great utility in this dashboard and beyond it. The current result seen here (as of 1/12/2020) is something I am very proud of and am looking forward to developing further as time goes on (and if one day I have nothing to report in these charts… well, I would be ok with that).

In addition to detailing United States gun violence, I also dedicated a section of this dashboard to my home city of Philadelphia. I chose this because of my love for this city but also because of the impressive amount of open source data the city has made available through the https://www.opendataphilly.org/ catalog which goes into greater detail than national archives.

Assumptions and Setup

A few assumptions are made for anyone wishing to recreate this dashboard:

  • You have a foundational understanding of R and RStudio
  • You have a foundational understanding of working in markdown
  • You have exposure to flexdashboard for dashboard creation

For anyone who would like to download this project, it is freely available on my Github page.

You can find the dashboard itself later on in this post, but also on my shinyapps.io site.

Data Sources and Rationale

The dashboard itself details some of this in the user guide but it’s worth mentioning here where the data for this dashboard comes from and why it was chosen.

Interestingly enough, as of this post there was no gun violence statistical data set available on the government Data.gov website. According to the Mass Shooting Tracker, one of the resources used for this dashboard, “at this point (December 2015) Congress has effectively blocked the CDC from researching the underlying causes of gun violence.” Due to this, two main sources were used:

Things to consider when looking at various data sources are the definitions of “mass shooting” (the FBI actually has no definition for what constitutes one), the availability of the data, and the curators who maintain it. Gun Violence Archive describes their organization as such:

“Gun Violence Archive (GVA) is a not for profit corporation formed in 2013 to provide free online public access to accurate information about gun-related violence in the United States.”

And defines a mass shooting as:

“GVA uses a purely statistical threshold to define mass shooting based ONLY on the numeric value of 4 or more shot or killed, not including the shooter.”

GVA is one of the most reputable and reliable sources of gun violence statistics and record keeping I’ve found. I highly recommend checking them out for national descriptive statistics where gun violence is concerned.

On the other hand, the Mass Shooting Tracker (MST) followed a slightly more relaxed definition:

“We define a “mass shooting” as a single outburst of violence in which more than 3 people are shot. This is not the same as mass murder as defined by the FBI.”

It is worth mentioning a few things about the Mass Shooting Tracker. The original website changed ownership in late 2019, and so the database was downloaded and frozen, only encompassing data from late 2013 to October of 2019. When viewing data in this dashboard, any national data beyond October 2019 relies solely on the GVA database. The new MST database is set up differently from its predecessor. I will be keeping an eye on it as it grows further to work on incorporating the new data into the dashboard. MST is also a crowd sourced platform. This means that anyone can submit data to it. On one hand this may initially cause alarm for faulty data, but it proves to actually be an amazing way to get data quickly and effectively. The submissions are also moderated by the owners who require validated links to the incidents.

Other potential sources of data worth mentioning include Mother Jones and the Stanford Geospatial Center MSA database. Unfortunately the MSA was discontinued and so I did not include it. Mother Jones contained data from pre-2013 but the data was so scarce that I opted not to include it, and any data beyond 2013 was already captured in GVA and MST.

Coding the Dashboard

Let’s start getting into how to construct the dashboard by walking through chunks of the code that will dictate our workflow.

Flexdashboard and Library Setup

If flexdashboard is foreign, know that it is built as a template through rmarkdown. So those familiar with using RMarkdown should find the below YAML header pretty easy to follow:

---
title: "Gun Violence Data Tracker"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    theme: readable
runtime: shiny
---

And for this dashboard a number of libraries will need to be loaded. If packages are missing, just call up the install.packages() command, but don’t forget to encapsulate the package name with double quotations.

# Load Applicable Libraries ====================================================
library(shiny) # Allows for shiny runtime application in the dashboard
library(tidyverse) # Mainly used for `dplyr` grammar and piping but also some other functions
library(rvest) # Required for web scraping data
library(purrr) # Used for mapping functions
library(lubridate) # Used for date functions
library(RColorBrewer) # Allows for grabbing certain colors for plotting
library(highcharter) # The primary interactive data visualization package used
library(flexdashboard) # The framework for the dashboard
library(leaflet) # Used for one of the GIS mapping plots
library(DT) # Used to create one of the filterable tables

Much of this post will assume familiarity with the flexdashboard setup (i.e. how to arrange panels, apply themes, set headers, etc.) that can be found over at the flexdashboard website.

Scraping Web Data

The first step in supplying a data stream to this dashboard is to scrape the data from our sources, namely the GVA website. Please consider checking out my post on Building Shiny Apps with Rvest for even more in depth discussion on what the methodology behind the rvest package is in relation to the code below. The GVA website stores their reports in separate year web addresses with clickable pages to scroll through entries. For example, the year 2014 has the web address “https://www.gunviolencearchive.org/reports/mass-shootings/2014?page=1” and scrolling through pages dictates the page number at the end of the HTML:

GVA Scroll Bar

Using rvest I can set up an extraction call to pull html nodes off the page and store their text values into a dataframe (the aforementioned post will describe how to determine what nodes to select using the selector gadget extension in the web browser). The first key element to note in the code below is that the “page=” at the end was left trailing on purpose. Looking back at the GVA website for the year 2014, there are 11 pages of data for this year. Using lapply and paste0 I can do all of our rvest data extractions to each page by supplying the web addresses for 0 through 11 to the lapply function and save it to the df2014 list.

Calling up df2014 without any transformations displays what looks like a large, unwieldly list. unlist will transform this into a single vector of elements. I don’t want to do this is because there is a noticeable pattern in each entry in this form: every 8 elements are the same category. This lets me call the seq command to start categorizing and grouping these into specific variables in a dataframe which will be much easier to work with.

# Year Frames===================================================================
# Extract all HTML nodes and assign to a dataframe for that year. To save space,
# column names are not labelled per year, so the following code MUST be run in
# order.

# Note that for Years 2016 on there is no incident ID column which must be
# dropped from the rest once combined

# 2014 Data ====================================================================
df2014 <- lapply(paste0("https://www.gunviolencearchive.org/reports/mass-shootings/2014?page=", 0:11),
                 function(url){
                   url %>% read_html() %>% 
                     html_nodes("tr td") %>% 
                     html_text()
                 })
df2014 <- unlist(df2014)
df2014 <- data.frame(
  incident_id = df2014[seq(from = 1, to = length(df2014), by = 8)],
  incident_date = df2014[seq(from = 2, to = length(df2014), by = 8)],
  state = df2014[seq(from = 3, to = length(df2014), by = 8)],
  city_county = df2014[seq(from = 4, to = length(df2014), by = 8)],
  address = df2014[seq(from = 5, to = length(df2014), by = 8)],
  no.killed = df2014[seq(from = 6, to = length(df2014), by = 8)],
  no.injured = df2014[seq(from = 7, to = length(df2014), by = 8)]
)

df2014[1] <- NULL

This step will be repeated for each of the years available. One of the downsides to this is the need to know how many pages each year has assigned to it (going over the true number causes the dataframe to repeat itself). So, for the current year (2020 as of this write-up) I will have to come back periodically to update how many pages have been made available. If anyone has a better solution to this please feel free to contact, I’m all ears!

A similar technique was previously used to scrape data from the MST prior to the freeze to a .csv file.

Now that all of the years are in their respective dataframes, combining and cleaning them are the remaining steps before visualization. One of the key things in working with dates is proper format setup. In this case, dates appear as a character string in the format “Month DD, YYYY”, but in almost all cases I prefer to have them as “YYYY-MM-DD” (notice a “B” in the date setup code below which allows for month names which were chosen for aesthetic purposes). Using a combination of commands like str_remove, str_split, and paste0 I can remove the elements of the date format I don’t want and convert it into the display I do want. Using state.abb[match(total_gv_df$state,state.name)] I can convert the state abbreviations to state names. Finally, I want to emphasize always checking the formats of the data since scraped data most likely will come as character text.

# Combined GV Dataset ==========================================================

# Place all year data frames in a list
all_years <- list(df2013, df2014, df2015, df2016, df2017, df2018, df2019, df2020)

# Apply the rbind command to all elements of the list using do.call()
total_gv_df <- do.call("rbind", all_years)

# The current format of the date is an integer so convert it to a character
# since this is a non-typical date setup
total_gv_df$incident_date <- as.character(total_gv_df$incident_date)

# Remove commas, split the string into separate Year, Day, Month columns,
# then recombine into one column called "new_incident_date" via paste0
# and define as a Date data type. Then change State names to abbreviations.
total_gv_df$incident_date <- str_remove(string = total_gv_df$incident_date, pattern = ",")
total_gv_df$month <- lapply(strsplit(as.character(total_gv_df$incident_date), split = " "),"[",1)
total_gv_df$day <- lapply(strsplit(as.character(total_gv_df$incident_date), split = " "),"[",2)
total_gv_df$year <- lapply(strsplit(as.character(total_gv_df$incident_date), split = " "),"[",3)
total_gv_df$date <- paste0(total_gv_df$year, "-", total_gv_df$month, "-", total_gv_df$day)
total_gv_df$date <- as.Date(total_gv_df$date, "%Y-%B-%d")
total_gv_df$state <- state.abb[match(total_gv_df$state,state.name)] 
total_gv_df$city_county <- as.character(total_gv_df$city_county)
total_gv_df$state <- as.character(total_gv_df$state)
total_gv_df$no.killed <- as.integer(total_gv_df$no.killed)
total_gv_df$no.injured <- as.integer(total_gv_df$no.injured)
# Remove everything with and inside of parentheses to preserve consistency when 
# joining with the MST database.
total_gv_df$city_county <- gsub("\\s*\\([^\\)]+\\)","",as.character(total_gv_df$city_county))

Introducing highcharter

For most of the visuals in this dashboard I chose to use the highcharter package, for which there is admittedly very scarce documentation for anyone with R as a primary language. The highcharts website has plenty to skim for an introduction for what can be done with it but little support for R users. The most prominent reference material is Joshua Kunst’s highcharter showcase page which is a decent starting point but soon leaves users with remaining questions as they dive deeper. My personal preference for interactive, HTML-based graphics is the plotly package due to Plotly’s extensive R documentation and general ease of use. However, I tend to enjoy highcharter’s cleaner visuals and smoother end output and that is exclusively why I am using it here.

I recommend always starting the highchart command through dplyr piping chains. Call the cleaned dataframe and pass it through the initial hchart call where users can specify the chart type (i.e. “column”) and aesthetic commands similar to typing out aes() in ggplot2. Unlike ggplot2 where aesthetic additions are tacked on using +, highcharter uses piping (%>%). A variety of commands allow fir the addition of add zoom features, axis specifications, credits, and really awesome themes that I encourage trying out. hc_exporting is another excellent feature highcharts gives where viewers can download the images of the visual in a variety of formats as well as a .csv of the data that populates the graph itself.

Below is an example of how to setup the first graphic in the dashboard, the monthly incidents over time.

GVA_Month <- total_gv_df %>%
  dplyr::select(date, no.killed) %>%
  group_by(Date = floor_date(x = date, unit = "month")) %>%
  tally(name = "Number of Deaths") %>%
  hchart("column", hcaes(x = Date, y = `Number of Deaths`), name = "n = ", colorByPoint = F, color = "white") %>% # Initial call and specs
  hc_chart(zoomType = "xy") %>% # Allow for zooming in and out of cartesian plane
  hc_add_theme(hc_theme_db()) %>% # Specify Color Theme
  hc_title(text = "GVA Deaths In Mass Shootings - Monthly", align = "left") %>% # Title specification
  hc_exporting(enabled = TRUE, filename = "Monthly_GV_Plot") %>% # Allow users to export data and visuals
  hc_yAxis(title = list(text = "Number of Deaths (n)", style = list(fontSize = "16px", color = "white")),
           labels = list(style = list(fontSize = "12px", color = "white"))) %>% 
  hc_xAxis(title = list(text = "Date", style = list(fontSize = "16px", color = "white")),
           labels = list(style = list(fontSize = "12px", color = "white"))) %>% 
  hc_credits(enabled = TRUE, text = "Source: https://www.gunviolencearchive.org/")

Quick Note: Shiny Filter Rendering

When working with a flexdashboard, it’s common to want to provide users with interactive options for changing and toggling displays. This requires the use of shiny, R’s primary application package, so that reactive displays and runtimes are possible. Making an actual shiny application takes a bit of learning and tutorials can be found on RStudio’s website for how to make your own. For now, shiny can be used very simply here to update our graphs by implementing a sidebar column in our main flexdashboard page:

Column {.sidebar}
-----------------------------------------------------------------------
dateInput(inputId = "startdate", label = "Select Start Date", value = "2013-01-01",min = "2013-01-01", format = "yyyy-mm-dd")

dateInput(inputId = "enddate", label = "Select End Date", value = "2019-12-31",min = "2013-01-01", format = "yyyy-mm-dd")

radioButtons(inputId = "victimtype", label = "Select Victim Type", selected = "Died", choices = c("Died", "Injured"))

This creates a sidebar off to the left with input widgets (in this case two date inputs and one radio button) that can be called in future code chunks of the rmarkdown document. All of the options and appropriate formatting can be found at RStudio’s Shiny support page. From here on, I can use input$startdate to call my initial start date widget and let the reactive elements of my code know to reference it and change when a user updates it.

The only remaining thing to do is create a code chunk in your rmarkdown file that calls a render command (such as renderHighchart({ })) and populate the internals with your filtering operations, affected variables, and final output.

Using Highmaps

After applying similar cleaning techniques to the MST database and combining it with the data from GVA using SQL-like joins from dplyr, I made my state_gv.df dataframe with three simple columns: the state abbreviation, the date of the incident, and a count of how many incidents occurred at that location on that day. I also made the state_pop dataframe indicative of the state population as of 2018 by pulling a .csv file and storing it in my Rproject from the U.S. census website.

Pulling the state populations allows for normalization of each state per capita. For example, in the final dashboard the state of California has an incredibly large number of mass shooting incidents compared to, say, New Jersey. There are many factors contributing to this but one glaring reason is the state is simply larger with a greater number of residents. Normalizing mapping visuals gives a rate of prevalence for the matter of interest and gives a much more accurate representation to the user.

Using hcmap, I can choose from a variety of maps available on Highchart’s website and choose “us-all” for all states in the U.S. Combining hcmap, the previously described technique combining highcharter with dplyr, and shiny render elements lets me make my first map displaying counts of mass shootings per state. Make note of the new commands in the hcmap inputs including joinBy and various style elements.


renderHighchart({

# Allow for Shiny filtering by start and end dates, group_by() state and tally() for count
  state_gv.df <- state_gv.df %>% 
    filter(date >= input$startdate & date <= input$enddate) %>% 
    group_by(STATE_ABBR) %>% 
    tally()
  
    hcmap("countries/us/us-all", value = "n", data = state_gv.df,
          joinBy = c("hc-a2", "STATE_ABBR"), name = "Mass Shooting Count:",
          dataLabels = list(enabled = TRUE, format = '{point.name}'),
          borderColor = "#FAFAFA", borderWidth = 0.1,
          tooltip = list(valueDecimals = 0, valuePrefix = "", valueSuffix = "")) %>% 
    hc_chart(zoomType = "xy") %>% 
    hc_chart(backgroundColor = "white") %>% 
    hc_legend(enabled = F) %>% 
    # hc_add_theme(hc_theme_db()) %>% 
    hc_title(text = paste0("Number of Gun Violence Incidents: ", substr(floor_date(input$startdate, "month"),1,7), " ",
             " to ", substr(floor_date(input$enddate, "month"),1,7)), align = "left", color = "white") %>% 
    hc_exporting(enabled = TRUE, filename = "GV_Map") %>% 
    hc_credits(enabled = TRUE, text = "Source: https://www.gunviolencearchive.org/, https://www.massshootingtracker.org/") %>% 
    hc_colorAxis(minColor = "white", maxColor = "firebrick", type = "logarithmic")
  
})

An extremely similar code chunk allows for the normalization variant of the map, with the only difference being some simple algebra using the Census data loaded earlier:

renderHighchart({
  state_gv.df <- state_gv.df %>% 
    filter(date >= input$startdate & date <= input$enddate) %>%
    group_by(STATE_ABBR) %>% 
    tally()
  
  state_gv_norm.df <- merge.data.frame(x = state_gv.df, y = state_pop, by = "STATE_ABBR")
  state_gv_norm.df$Normalization <- (state_gv_norm.df$n/state_gv_norm.df$Population) * 100000

GV Normalized Map

Incorporating Open Data Philly

Conveniently, Open Data Philly has made their shooting dataset very easy to extract and there is no need to call on rvest to scrape anything. A simple read.csv() command will do here since the file is made available on their website. I chose to round the longitudinal and latitudinal values to two digits to create better displays on the future leaflet map I’ll discuss.

# The following data comes from OpenDataPhilly.org

# Dataset load
philly.gv <- read.csv("https://phl.carto.com/api/v2/sql?q=SELECT+*,+ST_Y(the_geom)+AS+lat,+ST_X(the_geom)+AS+lng+FROM+shootings&filename=shootings&format=csv&skipfields=cartodb_id")
philly.gv$date_ <- ymd(philly.gv$date_) # Convert date to proper format

philly.gv$lng <- round(philly.gv$lng,2) # Round longitude and latitude to 2 digits
philly.gv$lat <- round(philly.gv$lat,2)


# Create a continuous palette function for future coloring
pal <- colorNumeric(
  palette = "magma",
  domain = philly.gv$n
)

Using leaflet

So why use another GIS-based package when hcmap worked fine in the last example? For my purposes, highcharter didn’t have any focal maps that I was looking for and since this part of the dashboard will focus on specific areas of a city, I wanted to call on leaflet’s mapping options instead. There is a very nice vignette describing how to set up leaflet and for this example I incorporate mapping thanks to CartoDB.

As with the highcharter maps, the leaflet map includes elements of dplyr to pass data and aesthetic commands in a tidy, succinct manner. Unlike before though, this example of leaflet relies on longitude and latitude values to appropriately map data points instead of in the previous mapping example where I relied on state names. The addProviderTiles command lets me call in the CartoDB map (I chose the “Positron” style here). setView is important here because I’m actually loading the entire world map provided by CartoDB, but I only want to focus on Philadelphia. setView takes in specified mapping coordinates to zoom in on the location of interest.

For the aesthetics of the datapoints themselves, I wanted to show prevalence of shooting incidents (recall that the Open Philly dataset is based on all shootings, not exclusively mass shooting incidents) in terms of color and size. addCircles does exactly that, but it took a lot of attempts before I was happy with the actual display. Tinkering with the radius and choosing a square root function provided the best differentiation between coordinates.


renderLeaflet({

# Tally the incident numbers for rounded long and lat values
  philly.gv_round <- philly.gv %>% 
    filter(date_ >= input$startdate2 & date_ <= input$enddate2) %>% 
    group_by(lat, lng) %>% 
    tally()
  
  
  leaflet(data = philly.gv_round) %>% 
    addProviderTiles(providers$CartoDB.Positron) %>% 
    setView(lng = -75.16,
            lat = 39.95,
            zoom = 13) %>% 
    addCircles(lng = ~lng, lat = ~lat, weight = 1,
               radius = ~sqrt(n)*50, 
               popup = ~paste(sep = "<br/>", "<b> Number of Incidents:</b>", as.character(n)), 
               fillColor = ~pal(n), stroke = NA, fillOpacity = 0.8
    ) %>% 
    addLegend("bottomright", pal = pal, values = ~n,
              title = "Shooting Incidents (n)",
              labFormat = labelFormat(prefix = ""),
              opacity = 1
    ) 

})

Philadelphia Map

Making use of datatables with DT

Up until now I have worked almost exclusively with elements of the tidyverse and to most R users I would recommend sticking with that. There is great deal of evidence supporting the use of DT and data.table functions for speed and efficiency, especially with large datasets, compared with tidyverse techniques but that is beyond the scope of this project write up. The real reason I’m calling in DT is because the package has an extremely convenient and easy-to-use filtering option embedded in its HTML table output. With very little effort I can provide users with a table that they can filter themselves for specific interests. Typically I would recommend users check out the kableExtra package which has fantastic creative options for pretty HTML tables, but for my purposes the DT vignette was all I needed to get started.

Using the CSV file from Open Data Philly and dplyr syntax to feed the data to DT (I understand the irony here considering my juxtaposition of the two above) involves only a few commands. Making a datatable with filtering elements is really as simple as just calling datatable on your dataframe. The additional elements I’ve included in the call below allow for creation of export buttons, captions, and row shading aesthetic elements.


philly.gv %>% 
  select("Code" = code, "Date" = date_, "Race" = race, "Sex" = sex, "Age" = age,
         "Wound Location" = wound, "Office Involved" = officer_involved, 
         "Offender Deceased" = offender_deceased, 
         "Offender Injured" = offender_injured, "Location" = location,
         "Latino" = latino, "Inside" = inside, "Outside" = outside, 
         "Fatal" = fatal) %>% 
  DT::datatable(class = 'cell-border compact hover stripe', 
                caption = htmltools::tags$caption(
                  style = 'caption-side: bottom; text-align: center;',
                  'Table 1: ', htmltools::em('Shooting incidents in Philadelphia, PA.')
                ),
                extensions = 'Buttons', options = list(
                  dom = 'Bfrtip',
                  buttons = c('csv')
                )
  )

The Final Product

There are a few more elements I included in the final dashboard including demographic charts, frequency plots, and a helpful user guide for anyone who interacts with the dashboard. All of these take on elements and techniques described previously and are available in the GitHub repository for anyone’s reference and use.

Now for the grand reveal (warning: it may take a minute to load and you will want to zoom out in your browser):

Final Words and Conclusions

Thank you for taking the time to read this post, it is the culmination of a lot of work and effort to pick up new skills and techniques and hone some that I use regularly. I’ve found a unique passion for looking into data as it relates to social justice, epidemiology, and public information as they are topic areas of profoundly increasing importance today. As stated at the beginning of this post, what is presented here is merely the facts and the data. What users and viewers choose to take from this is up to their own individual discretion.

One of the most interesting parts of creating this dashboard was truly realizing the relative newness of the data science field in its current form. Guns have existed well before 2013, but we are only just beginning to record and display data related to them since the advent of the digital era. On one hand it is disappointing to realize that detailed information is either too massive in scope for one person to collect (i.e. it would be a massive undertaking to get “Open Data ___” from every U.S. city), that some data may have yet to be digitized accessibly, or that some data may simply be lost to time. On the other hand, the effects of social change showed how even the definition of “mass shooting” is constantly in flux and may very well force this dashboard to change in the future.

Constructing this dashboard also incentivized me to think more about data as a way to tell a story. Whatever your views or reservations are for or against gun control and policy, the dashboard still shows objective conclusions:

  • U.S. mass shootings saw a steady increase from 2013 to 2016
  • Summer months always result in an increase in the number of mass shootings in the US
  • Northern Philadelphia sees more shooting incidents than any other part of the city

If I were to take a stance on anything in this write-up, it would be for advocacy of data literacy. To the best of my ability, I will always work to keep the data I display objective, pure, and fair. I hope that everyone who uses this product, or any that I create, does the due diligence to educate themselves on what is presented and also to alert me to anything I might have mistaken. As always, feel free to contact with comments, thoughts, or suggestions.

Avatar
Richard Hanna
Biomedical Engineer and Data Scientist