Importing XLSX and CSV into R

Dataframe Manipulation Using ‘rio’, ‘data.table’, and plyr’ 📇

Importing XLSX and CSV into R

Dataframe Manipulation Using ‘rio’, ‘data.table’, and plyr’ 📇

An Homage to the Spreadsheet

“If I have seen further it is by standing on the shoulders of giants.”

There is a considerable amount of debate over the origins of this quote (which happens to be one of my all-time favorites), but for the sake of simplicity let’s agree it was popularized by Isaac Newton. The significance of this phrase lies in knowing that none of the advancements we have today and take advantage of would be possible without the monumental achievements of those before us. But what does this have to do with spreadsheets?

In Defense of the Spreadsheet

If you talk to most people in the data science community and drop the phrase “spreadsheet” you’ll be met with mixed feelings ranging from nostalgia to borderline animosity. Why such polarizing feelings? I recently listened to DataCamp’s DataFramed podcast in which the host of the show, Hugo Bowne-Anderson, and Jenny Bryan discuss the pluses and minuses of the spreadsheet and share thoughts on its place in the past, present, and future of data science. They did an incredible job and I highly recommend checking out both the episode and the podcast in general if you haven’t already.

To echo some of their conclusions, the spreadsheet has proven to be an incredibly accessible, maneuverable, and friendly instrument with a low barrier to entry for individuals of various backgrounds. The fact that most computers, especially Windows ones, come readily-equipped with the Microsoft suite has made spreadsheet programs like Excel fixed in most modern-day work flows. If you want to look for more advanced momentum of spreadsheets you need only look to Google’s free Google Sheets platform for a comprehensive cloud-based system that many individuals, especially students, employ regularly. If I can highlight two of the most significant strengths that spreadsheets offer it would be:

  • The ability to see your data and manipulate it simply by using a mouse
  • The ability to store multiple elements, tables, or even pictures on a single plane

In Opposition of the Spreadsheet

And what about the pitfalls? In my field of work, I collaborate with individuals of many different backgrounds and find that someone with a STEM background can often times be in the minority. If I had a dollar for every time I heard “we need a log for this” and the log of choice was a spreadsheet…I’d still have a lot of student loans to pay off, but it might take a chunk out of them. I’m notoriously begrudging of spreadsheets in my team for a few main reasons:

  • They are not reproducible
  • They are extremely prone to human error
  • Version control can turn into a nightmare
  • Conditional formatting becomes a crutch and cannot translate into other mediums
  • (My opinion solely but) By the time you learn the advanced VBA tools for different techniques you should probably be asking yourself why you’re not just using Python, R, etc.

And yet, with all that being said, I want to stress maybe the most important point of all:

SPREADSHEETS ARE NOT GOING ANYWHERE

As much as I would love to wake up in a world where Excel didn’t matter, the truth is that I’m not even sure I could fully go without relying on it to some degree. And the vast amount of data that currently exists in spreadsheets coupled with the ability of people from various backgrounds to navigate them means that instead of fighting them… we have to learn to work with them.

An XLS, a CSV, and a R-script Walk into a Bar

I want to start this off by saying that there may be better ways of executing what I am about to walk through, and if you know of any please let me know in the comments section! I am fully capable of admitting when I am wrong or when there is a better way of completing a task.

By the end of this post our goal is to:

  • Read in a list of .xlsx files
  • Convert them to .csv format
  • Convert the data into both combined and separated R dataframes

So let’s get started! The first thing to do is install the following packages and call them if you don’t have them already:

install.packages("rio", "data.table", "plyr")

library("rio")
library("data.table") 
library("plyr")

You may receive error messages or prompts about package versions or formats you need to install, especially for rio, just follow the prompts accordingly. rio is one of CRAN’s self-proclaimed “swiss army knives” for streamlined data I/O and manipulation.

To show you the files I will be working with, I am using de-identified chest-compression CPR data in a folder on my desktop that comes pre-formatted with identical column headers.

xlsx files

xlsx cols

Next I’m going to establish a variable designated as “folder” and assign the folder file path to it. Then I’m going to assign a pattern for R to detect using the dir command and assign this variable to “xlsx”. The mapply command is an extremely powerful tool allowing for function application across elements and we use it here to take advantage of rio’s convert command to make a dataframe that will import the identified .xlsx file extension files and output them to .csv formatted files.

folder <- "D:/Coding/R/Blogdown/CSV_Post/2018_July/"

xlsx <- dir(pattern = "xlsx")
created <- mapply(convert, xlsx, gsub("xlsx", "csv", xlsx))

This next line of code is to be executed only if you are totally ok with deleting the .xlsx files. Otherwise I recommend saving a copy of them elsewhere.

unlink(xlsx) # delete xlsx files

Next let’s create a list of all .csv files in the folder:

file_list <- list.files(path=folder, pattern="*.csv")

Now I’m going to show you two ways of constructing dataframes. The first is a loop below which will assign each of your individual .csv’s into individual dataframes. In this for loop we are telling R to read along the existing files in the folder, and for each file we want to read the .csv data and assign it to a dataframe with the name of the .csv file.

for (i in 1:length(file_list)){
  assign(file_list[i], 
         read.csv(paste(folder, file_list[i], sep=''))
  )}

Result:

csv files

However, what if we don’t want a hundred dataframes in our environment? What if instead we want to capture all of this data under a single dataframe? To accomplish this I’m going to make a few choices to help us keep track of what data belongs to which file despite being nested under the same dataframe roof.

First we’re going to define a function called csv_file_read which will read in the .csv files and create a return value for each row based on the name of the file/dataframe it belonged to. Then we’ll combine all of these under one dataframe called “CPR_Metrics” and use ldply from the plyr package to apply the function across the dataframe similar to earlier when we used mapply. This leaves us with a column of values called “Source”, but the values all still have the “.csv” file extension in the name. To drop this, we remove the “.csv” portion using sub from base R.

# Read all csv files and create an ID column, named "source", based on the file name
csv_file_read <- function(file_list){
  ret <- read.csv(file_list)
  ret$Source <- file_list #EDIT
  ret
}

# Combine all CSV files in the folder under one data frame containing the "Source"" column
CPR_Metrics <- ldply(file_list, csv_file_read)

# Remove the ".csv" portion of the file name identifiers 
CPR_Metrics[,8] <- sub(".csv", "", CPR_Metrics[,8])

Here is what the data now looks like with the appropriate “Source” column and proper formatting, showing what dataset the rows belong to:

source col

As a final sanity check, let’s make sure the right elements exist in the “Source” column by using the unique command from base R to see what unique values exist in CPR_Metrics$Source:

# Data frame containing all unique IDs in the CPR_Metrics$Source column
Source_IDs <- unique(CPR_Metrics$Source)

source IDs

Conclusion

I hope you found this post informative and helpful! Reading, writing, and working with spreadsheets can be a time consuming and daunting process, but with some tinkering in R you can streamline the workflow and house everything to your comfort level. Feel free to leave thoughts, suggestions, or comments below and thank you for reading!

numbers

Avatar
Richard Hanna
Biomedical Engineer and Data Scientist

Related