Introduction - Why We’re Doing This
I’m at a bit of a middle ground when it comes to my tech. I feel like there are tiers of “tech-ascension” and they range from “you still have a flip phone” to “half the items you’re wearing have a digital output.” This isn’t to speak highly or ill of anyone who falls within this range or at either side of the extremes, it just sets up part of the reason for why this post exists. I place myself at the point of “I have a relatively new iPhone but have not vaulted over to an Apple Watch and so the data streams I’m going to talk about in this post revolve around the data that the Apple Health app (in conjunction with a few supporting players) can give you. An Apple Watch, Fitbit, or other health tracker will provide richer and more robust data, but since I already have the phone and it’s already recording data this will be a fine place to start!
For the past few months I decided to tap into the Apple Health app and see what I could track about myself, my habits, and glean some information.
Apple Health tracks a few things on its own by tapping into various mechanical features in your iPhone. Components like the internal accelerometer and gyroscope help track steps and flights of stairs climbed. For other external features like calorie trackers and fitness records the app has great integration with many other third party applications. For these, I use Lifesum to track my meals and caloric intake, as well as content, and Runkeeper to track my workouts and associated kcals burned.
You can find a few different ways of exporting Apple Health data, the most apparent one being the “Export Health Data” feature under the user profile. This will bundle up the data and send you a ZIP HTML file. However, I found this a bit cumbersome to use and wanted something even more user-friendly to folks like me more apt to use R. I highly recommend visiting Ryan Praski’s blogpost on this process as I learned quite a bit from it. But to mitigate the need for processing an HTML file, I turned to an app called QS Access (“Quantified Self”) which I could NOT recommend more. It’s nothing flashy, but it does exactly what we need it to do: ship a conveniently labeled, easy to read CSV file containing all of the data from our three source streams (Lifesum, Runkeeper, and the iPhone itself). You can also export to hourly or daily data, in this post we’ll use both with noting two things:
- The hourly data is a much larger file for the months we’re analyzing (November 2018 – February 2019)
- Lifesum data does not come with an accurate time stamp. Instead “Breakfast,” “Lunch,” “Dinner,” and “Snack” categories all come with a pre-designated time assignment. Therefore it will not be possible to analyze trends of caloric intake on the basis of time of day.
If there’s one thing I’ve come to realize in my brief 1.5 year budding career in data science, it’s that a large majority of what you’ll end up doing is cleaning data. Not as much of the more appealing graphics warranting “ooohs” and “aaahs” but generally a lot more manipulation, conversion, and pre-processing.
Throughout this process we’ll take advantage of three main R-libraries which I recommend becoming familiar with right away, they make your life easier and are extremely user friendly:
First let’s take a quick look at what this CSV holds and how it’s structured:
Columns are luckily named with straightforward IDs and even include units. Since I find the column names useful and informative, if a little clunky, I’m going to opt to keep them as is for now. Some minor cleaning will involve date-time standardizing and creating additional columns for usability.
lubridate has been a godsend in allowing for easy date-time conversions and readability. We’re going to apply
ymd_hms() command family to convert our QS Access csv export dates to readable, standard formats. This family of commands can change based on whatever order you encounter your date/time data.
# QS Access Daily Health Data -------------------------------------------------- ## Data Cleaning --------------------------------------------------------------- qsd <- read.csv("Health Data Day.csv") qsd$Start <- ymd(qsd$Start) # convert to readable date qsd$Finish <- ymd(qsd$Finish) # convert to readable date # Extract appopriate portions of the date to create individualized columns in the # dataframe for easier queries later. qsd$Month<-format(qsd$Start,"%m") qsd$Year<-format(qsd$Start,"%Y") qsd$Date<-format(qsd$Start,"%Y-%m-%d") qsd$DayOfWeek <-wday(qsd$Start, label=TRUE, abbr=FALSE) # Keep labels, but don't abbreviate qsd <- subset.data.frame(qsd, qsd$Start > '2018-11-04') qsd$Dietary.Calories..cal. <- qsd$Dietary.Calories..cal./1000 # Proper caloric units
And just like that,
lubridate has solved an otherwise time-intensive task of converting the date data on our own. One thing you may or may not note here is that I am specifying dates only beyond November of 2018. This actually is due to a bug in an iOS update where the health data reported the same step count every hour for a two month period. Luckily I’m not alone but sadly this data is no longer usable since I’m positive I’m not a sleepwalker.
Next it’s a simple matter of combining
dplyr and the “grammar of graphics,”
ggplot2, to yield some initial plots. I’ve spoken in previous posts about
ggplot2 and it’s such a universally ubiquitous library at this point that I won’t get too into it.
dplyr is worth mentioning here though, since if you spend any amount of time in R you will run into the unique piping syntax,
“Piping” is unique to
dplyr, the library being another branch of the
tidyverse along with
ggplot2 is the grammar of graphics, then think of
dplyr as the means by which we send data through the pipeline (the
dplyr library is coined as the “grammar of data manipulation”. Piping and the
%>% syntax allows for a chaining of data to condense code that is easier to follow and saves computation: f(x,y) becomes x %>% f(y). This also allows you to avoid a ton of nested
()s in your code. As an aside, if you begin using
dplyr give yourself the gift of
CTRL+SHIFT+M as a nice little hotkey for making the piping syntax.
That’s Cool, But I’m Just Here for the Plots
Well fine, let’s get into them then, geez. To see
ggplot2 in action, we’re going to create our first two plots showing caloric intake and calories burned through exercise:
# Average Bar Plots ------------------------------------------------------------ # The plot below is a bar plot investigating the average caloric intake I have # By day of week on average, eliminating all days where either I forgot to put in # an entry or Lifesum didn't connect properly. avg.cal <- qsd %>% select(DayOfWeek, Dietary.Calories..cal.) %>% group_by(DayOfWeek) %>% subset(Dietary.Calories..cal. != 0) %>% summarise(avgdcal = mean(Dietary.Calories..cal.)) %>% ggplot(aes(x=DayOfWeek, y = avgdcal, fill = DayOfWeek)) + geom_bar(stat = 'identity') + ggtitle("Average Caloric Intake", subtitle = "By Day of Week") + xlab("Day of Week") + ylab("Calories consumed (cal)") + guides(fill=guide_legend(title="Day of Week")) + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Similar to the dietary coloric intake, but this time investigating average kcal # burned for days I recorded excercise activity. avg.kcal <- qsd %>% select(DayOfWeek, Active.Calories..kcal.) %>% group_by(DayOfWeek) %>% subset(Active.Calories..kcal. != 0) %>% summarise(avgkcal = mean(Active.Calories..kcal.)) %>% ggplot(aes(x=DayOfWeek, y = avgkcal, fill = DayOfWeek)) + geom_bar(stat = 'identity') + ggtitle("Average kCals Burned Through Exercise" , subtitle = "By Day of Week") + xlab("Day of Week") + ylab("Calories Burned (kcal)") + guides(fill=guide_legend(title="Day of Week")) + theme_bw() + theme(axis.text.x = element_text(angle = 90, hjust = 1))
Here we can see
dplyr in action, taking our “quantified self dataset” (
qsd) and applying R-ified, SQL-type commands. Functions like
filter are very powerful for extracting and subsetting the specific data we want from the larger
qsd dataframe. One thing to note, in order to not drag down the average values from these analyses, I dropped all data points equal to zero where I either did not exercise or did not log in my caloric intake data. You could consider this an inaccurate display of the data, but I found it much more inaccurate to have all of the zero values weigh down the mean so heavily. Now let’s see those plots!
From here we can begin to extract some meaning and preliminary information. I ingest the most calories on average on Thursday and Friday; this makes sense since my diet tends to be a bit more relaxed towards the end of the week when lunches are catered and dinners are a little less bland. I likely also snack more. On the flipside, if we look at the kcals plot, you will find a large average caloric burn on Sundays…typically when I try and rectify the damage I’ve done over the weekend! You’ll also notice something strange… Saturday is missing. That’s simply because I’ve never gone to the gym on a Saturday since that’s normally my dedicated day off. Whether or not these are contextually meaningful pieces of information is up for debate, but for me personally this is interesting to see and could easily be determined by any individual for their personal information.
Last, before we move on to the hourly data, I think it would be interesting to take a look at the Apple Health step counter. I tend to walk to work a lot, the average walk is a little over a mile, one way, from my home. Using the same techniques we can track average steps taken by day of the week, taking note of the drastic change in count between weekends and weekdays:
Now, using similar data filtering techniques from before, we can begin to answer questions using the hourly data file produced by QS Access. To visualize this, a “heat map” can give us interpretability of three components: day of week, time of day, and number of average steps:
# QS Access Hourly Data -------------------------------------------------------- qsh <- read.csv("Health Data Hour.csv") head(qsh) qsh$StartDay <- ymd(qsh$StartDay) # convert to readable date qsh$FinishDay <- ymd(qsh$FinishDay) # convert to readable date qsh$StartHour <- hm(qsh$StartHour) qsh$FinishHour <- hm(qsh$FinishHour) qsh$Month<-format(qsh$FinishDay,"%m") qsh$Year<-format(qsh$FinishDay,"%Y") qsh$Date<-format(qsh$FinishDay,"%Y-%m-%d") qsh$DayOfWeek <-wday(qsh$FinishDay, label=TRUE, abbr=FALSE) qsh$Hour <-format(qsh$FinishHour,"%H") qsh$Hour <- factor(qsh$Hour, levels = unique(qsh$Hour)) # Maintain order of time, 'unique' is important to avoid duplication error qsh <- subset.data.frame(qsh, qsh$StartDay > '2018-11-04') # Start of good data qsh$Dietary.Calories..cal. <- qsh$Dietary.Calories..cal./1000 # Proper caloric units # Since there is no data for the hours between 1AM and 8AM it makes sense to # remove these rows so geom_tile doesn't display blank spaces (so unsightly!) qshDay <- qsh qshDay <- qshDay[!(qshDay$Hour == "1H 0M 0S" | qshDay$Hour == "2H 0M 0S" | qshDay$Hour == "3H 0M 0S" | qshDay$Hour == "4H 0M 0S" | qshDay$Hour == "5H 0M 0S" | qshDay$Hour == "6H 0M 0S" | qshDay$Hour == "7H 0M 0S"),] step.hm <- qshDay %>% select(DayOfWeek, Hour, Steps..count.) %>% group_by(DayOfWeek, Hour) %>% subset(Steps..count. != 0) %>% summarise(avgsteps = mean(Steps..count.)) %>% ggplot(aes(x=DayOfWeek, y = Hour, fill = avgsteps)) + geom_tile() + scale_fill_continuous(limits=c(0, 3000), breaks=seq(0,3000,by=500), type = "viridis") + ggtitle("Average Steps Taken" , subtitle = "By Day of Week & Time of Day") + xlab("Day of Week") + ylab("Hour of Day") + guides(fill=guide_legend(title="Average Steps")) + theme(axis.text.x = element_text(angle = 90, hjust = 1))
Excellent! This also makes a ton of sense. I try and get to work by 08:30a.m. so it would stand to reason that my concentration of steps would be in the light green rectangle you see. My leaving times vary and occasionally I catch a bus, so the afternoon is a bit more spread out.
Data Viz %>% Statistics
I want to touch lightly on some more advanced analytics. Let’s say we want to check on the correlation between caloric intake and saturated fat content. R has built-in stats commands that make it easy to check the correlations and statistical values of different variables.
cor.test, and a simple
summary are all capable of yielding insights on R, R^2, and p-value statistics. Definitely check out their documentation with the
? command in your console for a more in-depth explanation.
ggplot2, we can check the correlation of saturated fat intake as a function of caloric intake.
sat_cal.lm <-lm(qshDay$Saturated.Fat..g. ~ qshDay$Dietary.Calories..cal.) Sat_Cal.Stats <- cor.test(qshDay$Saturated.Fat..g., qshDay$Dietary.Calories..cal., method = 'pearson') Sat_Cal.Stats_Full <- summary(lm(qshDay$Saturated.Fat..g. ~ qshDay$Dietary.Calories..cal.)) Sat_Cal.Scatter <- qshDay %>% select(Dietary.Calories..cal., Saturated.Fat..g.) %>% ggplot(aes(x = Dietary.Calories..cal., y = Saturated.Fat..g.)) + geom_point(color = 'firebrick1') + stat_smooth(method="lm", se=TRUE) + theme_bw() + ggtitle("Caloric Intake ~ Saturated Fat" , subtitle = "How well correlated are they?") + xlab("Saturated Fat Intake (g)") + ylab("Dietary Caloric Intake (kcal)") + geom_text(x = 500, y = 28, label = paste0("R = ", round(Sat_Cal.Stats$estimate,3))) + geom_text(x = 510, y = 26, label = paste0("R^2 = ", round(Sat_Cal.Stats_Full$r.squared,3))) + geom_text(x = 510, y = 24, label = paste0("p < 2.2e-16"))
Here we can see a high correlation between the two variables, with a statistically significant p-value of <2.2e-16. An R^2 of 0.766 also indicates a positive correlation. We can now apply this technique to any of the variables we have an interest in! For example, saturated fat intake as a function of protein:
prot_sat.lm <-lm(qsh$Protein..g. ~ qsh$Saturated.Fat..g.) Sat_Prot.Stats <- cor.test(qsh$Protein..g., qsh$Saturated.Fat..g., method = 'pearson') Sat_Prot.Stats_Full <- summary(lm(qsh$Protein..g. ~ qsh$Saturated.Fat..g.)) Sat_Prot.Scatter <- qsh %>% select(Protein..g., Saturated.Fat..g.) %>% ggplot(aes(x = Protein..g., y = Saturated.Fat..g.)) + geom_point(color = 'firebrick1') + stat_smooth(method="lm", se=TRUE) + theme_bw() + ggtitle("Saturated Fat Intake ~ Protein" , subtitle = "How well correlated are they?") + xlab("Protein Intake (g)") + ylab("Saturated Fat Intake (g)") + geom_text(x = 10, y = 25, label = paste0("R = ", round(Sat_Prot.Stats$estimate,3))) + geom_text(x = 11, y = 27, label = paste0("R^2 = ", round(Sat_Prot.Stats_Full$r.squared,3))) + geom_text(x = 11, y = 23, label = paste0("p < 2.2e-16"))
Wrapping Things Up
In summary, this post walked through the following:
- Extracting data from Apple Health and supporting applications
ggplot2to produce beautiful graphics in a simple, easy manner
lubridateto make date-time data points easy to process
dplyrfor data synthesizing and manipulation
- Basic statistical methods for analyzing correlations between data points
I hope you enjoyed following this post, and I welcome any criticism, suggestions, or feedback! If there’s another thing I’ve learned about R it’s that there’s plenty of ways to do the same thing…often times with progressive methods typically better than the last. I plan to expand on this topic in the future and begin exploring other datasets with greater application to more advanced methods. Thanks for taking the time to read this, hope you learned a thing or two!