Introduction - Why We’re Doing This
Iâm at a bit of a middle ground when it comes to my tech. I feel like there are tiers of âtech-ascensionâ and they range from âyou still have a flip phoneâ to âhalf the items youâre wearing have a digital output.â This isnât to speak highly or ill of anyone who falls within this range or at either side of the extremes, it just sets up part of the reason for why this post exists. I place myself at the point of âI have a relatively new iPhone but have not vaulted over to an Apple Watch and so the data streams Iâm going to talk about in this post revolve around the data that the Apple Health app (in conjunction with a few supporting players) can give you. An Apple Watch, Fitbit, or other health tracker will provide richer and more robust data, but since I already have the phone and itâs already recording data this will be a fine place to start!
For the past few months I decided to tap into the Apple Health app and see what I could track about myself, my habits, and glean some information.
Additional Software
Apple Health tracks a few things on its own by tapping into various mechanical features in your iPhone. Components like the internal accelerometer and gyroscope help track steps and flights of stairs climbed. For other external features like calorie trackers and fitness records the app has great integration with many other third party applications. For these, I use Lifesum to track my meals and caloric intake, as well as content, and Runkeeper to track my workouts and associated kcals burned.
You can find a few different ways of exporting Apple Health data, the most apparent one being the âExport Health Dataâ feature under the user profile. This will bundle up the data and send you a ZIP HTML file. However, I found this a bit cumbersome to use and wanted something even more user-friendly to folks like me more apt to use R. I highly recommend visiting Ryan Praski’s blogpost on this process as I learned quite a bit from it. But to mitigate the need for processing an HTML file, I turned to an app called QS Access (âQuantified Selfâ) which I could NOT recommend more. Itâs nothing flashy, but it does exactly what we need it to do: ship a conveniently labeled, easy to read CSV file containing all of the data from our three source streams (Lifesum, Runkeeper, and the iPhone itself). You can also export to hourly or daily data, in this post weâll use both with noting two things:
- The hourly data is a much larger file for the months weâre analyzing (November 2018 â February 2019)
- Lifesum data does not come with an accurate time stamp. Instead âBreakfast,â âLunch,â âDinner,â and âSnackâ categories all come with a pre-designated time assignment. Therefore it will not be possible to analyze trends of caloric intake on the basis of time of day.
Data Cleaning
If thereâs one thing Iâve come to realize in my brief 1.5 year budding career in data science, itâs that a large majority of what youâll end up doing is cleaning data. Not as much of the more appealing graphics warranting âooohsâ and âaaahsâ but generally a lot more manipulation, conversion, and pre-processing.
Throughout this process weâll take advantage of three main R-libraries which I recommend becoming familiar with right away, they make your life easier and are extremely user friendly:
dplyr
ggplot2
lubridate
First let’s take a quick look at what this CSV holds and how it’s structured:
Columns are luckily named with straightforward IDs and even include units. Since I find the column names useful and informative, if a little clunky, I’m going to opt to keep them as is for now. Some minor cleaning will involve date-time standardizing and creating additional columns for usability.
lubridate
has been a godsend in allowing for easy date-time conversions and readability. Weâre going to apply lubridate
âs ymd_hms()
command family to convert our QS Access csv export dates to readable, standard formats. This family of commands can change based on whatever order you encounter your date/time data.
# QS Access Daily Health Data --------------------------------------------------
## Data Cleaning ---------------------------------------------------------------
qsd <- read.csv("Health Data Day.csv")
qsd$Start <- ymd(qsd$Start) # convert to readable date
qsd$Finish <- ymd(qsd$Finish) # convert to readable date
# Extract appopriate portions of the date to create individualized columns in the
# dataframe for easier queries later.
qsd$Month<-format(qsd$Start,"%m")
qsd$Year<-format(qsd$Start,"%Y")
qsd$Date<-format(qsd$Start,"%Y-%m-%d")
qsd$DayOfWeek <-wday(qsd$Start, label=TRUE, abbr=FALSE) # Keep labels, but don't abbreviate
qsd <- subset.data.frame(qsd, qsd$Start > '2018-11-04')
qsd$Dietary.Calories..cal. <- qsd$Dietary.Calories..cal./1000 # Proper caloric units
And just like that, lubridate
has solved an otherwise time-intensive task of converting the date data on our own. One thing you may or may not note here is that I am specifying dates only beyond November of 2018. This actually is due to a bug in an iOS update where the health data reported the same step count every hour for a two month period. Luckily I’m not alone but sadly this data is no longer usable since I’m positive I’m not a sleepwalker.
Next itâs a simple matter of combining dplyr
and the âgrammar of graphics,â ggplot2
, to yield some initial plots. Iâve spoken in previous posts about ggplot2
and itâs such a universally ubiquitous library at this point that I wonât get too into it. dplyr
is worth mentioning here though, since if you spend any amount of time in R you will run into the unique piping syntax, %>%
.
âPipingâ is unique to dplyr
, the library being another branch of the tidyverse
along with ggplot2
. If ggplot2
is the grammar of graphics, then think of dplyr
as the means by which we send data through the pipeline (the dplyr
library is coined as the âgrammar of data manipulationâ. Piping and the %>%
syntax allows for a chaining of data to condense code that is easier to follow and saves computation: f(x,y) becomes x %>% f(y). This also allows you to avoid a ton of nested ()
s in your code. As an aside, if you begin using dplyr
give yourself the gift of CTRL+SHIFT+M
as a nice little hotkey for making the piping syntax.
That’s Cool, But I’m Just Here for the Plots
Well fine, let’s get into them then, geez. To see dplyr
and ggplot2
in action, weâre going to create our first two plots showing caloric intake and calories burned through exercise:
# Average Bar Plots ------------------------------------------------------------
# The plot below is a bar plot investigating the average caloric intake I have
# By day of week on average, eliminating all days where either I forgot to put in
# an entry or Lifesum didn't connect properly.
avg.cal <- qsd %>%
select(DayOfWeek, Dietary.Calories..cal.) %>%
group_by(DayOfWeek) %>%
subset(Dietary.Calories..cal. != 0) %>%
summarise(avgdcal = mean(Dietary.Calories..cal.)) %>%
ggplot(aes(x=DayOfWeek, y = avgdcal, fill = DayOfWeek)) +
geom_bar(stat = 'identity') + ggtitle("Average Caloric Intake", subtitle = "By Day of Week") +
xlab("Day of Week") + ylab("Calories consumed (cal)") +
guides(fill=guide_legend(title="Day of Week")) + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Similar to the dietary coloric intake, but this time investigating average kcal
# burned for days I recorded excercise activity.
avg.kcal <- qsd %>%
select(DayOfWeek, Active.Calories..kcal.) %>%
group_by(DayOfWeek) %>%
subset(Active.Calories..kcal. != 0) %>%
summarise(avgkcal = mean(Active.Calories..kcal.)) %>%
ggplot(aes(x=DayOfWeek, y = avgkcal, fill = DayOfWeek)) +
geom_bar(stat = 'identity') +
ggtitle("Average kCals Burned Through Exercise" , subtitle = "By Day of Week") +
xlab("Day of Week") + ylab("Calories Burned (kcal)") +
guides(fill=guide_legend(title="Day of Week")) + theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Here we can see dplyr
in action, taking our “quantified self dataset” (qsd
) and applying R-ified, SQL-type commands. Functions like select
, group_by
, and filter
are very powerful for extracting and subsetting the specific data we want from the larger qsd
dataframe. One thing to note, in order to not drag down the average values from these analyses, I dropped all data points equal to zero where I either did not exercise or did not log in my caloric intake data. You could consider this an inaccurate display of the data, but I found it much more inaccurate to have all of the zero values weigh down the mean so heavily. Now letâs see those plots!
From here we can begin to extract some meaning and preliminary information. I ingest the most calories on average on Thursday and Friday; this makes sense since my diet tends to be a bit more relaxed towards the end of the week when lunches are catered and dinners are a little less bland. I likely also snack more. On the flipside, if we look at the kcals plot, you will find a large average caloric burn on SundaysâŚtypically when I try and rectify the damage Iâve done over the weekend! Youâll also notice something strange⌠Saturday is missing. Thatâs simply because Iâve never gone to the gym on a Saturday since thatâs normally my dedicated day off. Whether or not these are contextually meaningful pieces of information is up for debate, but for me personally this is interesting to see and could easily be determined by any individual for their personal information.
Last, before we move on to the hourly data, I think it would be interesting to take a look at the Apple Health step counter. I tend to walk to work a lot, the average walk is a little over a mile, one way, from my home. Using the same techniques we can track average steps taken by day of the week, taking note of the drastic change in count between weekends and weekdays:
Now, using similar data filtering techniques from before, we can begin to answer questions using the hourly data file produced by QS Access. To visualize this, a âheat mapâ can give us interpretability of three components: day of week, time of day, and number of average steps:
# QS Access Hourly Data --------------------------------------------------------
qsh <- read.csv("Health Data Hour.csv")
head(qsh)
qsh$StartDay <- ymd(qsh$StartDay) # convert to readable date
qsh$FinishDay <- ymd(qsh$FinishDay) # convert to readable date
qsh$StartHour <- hm(qsh$StartHour)
qsh$FinishHour <- hm(qsh$FinishHour)
qsh$Month<-format(qsh$FinishDay,"%m")
qsh$Year<-format(qsh$FinishDay,"%Y")
qsh$Date<-format(qsh$FinishDay,"%Y-%m-%d")
qsh$DayOfWeek <-wday(qsh$FinishDay, label=TRUE, abbr=FALSE)
qsh$Hour <-format(qsh$FinishHour,"%H")
qsh$Hour <- factor(qsh$Hour, levels = unique(qsh$Hour)) # Maintain order of time, 'unique' is important to avoid duplication error
qsh <- subset.data.frame(qsh, qsh$StartDay > '2018-11-04') # Start of good data
qsh$Dietary.Calories..cal. <- qsh$Dietary.Calories..cal./1000 # Proper caloric units
# Since there is no data for the hours between 1AM and 8AM it makes sense to
# remove these rows so geom_tile doesn't display blank spaces (so unsightly!)
qshDay <- qsh
qshDay <- qshDay[!(qshDay$Hour == "1H 0M 0S" |
qshDay$Hour == "2H 0M 0S" | qshDay$Hour == "3H 0M 0S" |
qshDay$Hour == "4H 0M 0S" | qshDay$Hour == "5H 0M 0S" |
qshDay$Hour == "6H 0M 0S" | qshDay$Hour == "7H 0M 0S"),]
step.hm <- qshDay %>%
select(DayOfWeek, Hour, Steps..count.) %>%
group_by(DayOfWeek, Hour) %>%
subset(Steps..count. != 0) %>%
summarise(avgsteps = mean(Steps..count.)) %>%
ggplot(aes(x=DayOfWeek, y = Hour, fill = avgsteps)) + geom_tile() +
scale_fill_continuous(limits=c(0, 3000), breaks=seq(0,3000,by=500), type = "viridis") +
ggtitle("Average Steps Taken" , subtitle = "By Day of Week & Time of Day") +
xlab("Day of Week") + ylab("Hour of Day") +
guides(fill=guide_legend(title="Average Steps")) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Excellent! This also makes a ton of sense. I try and get to work by 08:30a.m. so it would stand to reason that my concentration of steps would be in the light green rectangle you see. My leaving times vary and occasionally I catch a bus, so the afternoon is a bit more spread out.
Data Viz %>% Statistics
I want to touch lightly on some more advanced analytics. Letâs say we want to check on the correlation between caloric intake and saturated fat content. R has built-in stats commands that make it easy to check the correlations and statistical values of different variables. lm
, cor.test
, and a simple summary
are all capable of yielding insights on R, R^2, and p-value statistics. Definitely check out their documentation with the ?
command in your console for a more in-depth explanation.
Then, using ggplot2
, we can check the correlation of saturated fat intake as a function of caloric intake.
sat_cal.lm <-lm(qshDay$Saturated.Fat..g. ~ qshDay$Dietary.Calories..cal.)
Sat_Cal.Stats <- cor.test(qshDay$Saturated.Fat..g., qshDay$Dietary.Calories..cal., method = 'pearson')
Sat_Cal.Stats_Full <- summary(lm(qshDay$Saturated.Fat..g. ~ qshDay$Dietary.Calories..cal.))
Sat_Cal.Scatter <- qshDay %>%
select(Dietary.Calories..cal., Saturated.Fat..g.) %>%
ggplot(aes(x = Dietary.Calories..cal., y = Saturated.Fat..g.)) + geom_point(color = 'firebrick1') +
stat_smooth(method="lm", se=TRUE) + theme_bw() +
ggtitle("Caloric Intake ~ Saturated Fat" , subtitle = "How well correlated are they?") +
xlab("Saturated Fat Intake (g)") + ylab("Dietary Caloric Intake (kcal)") +
geom_text(x = 500, y = 28, label = paste0("R = ", round(Sat_Cal.Stats$estimate,3))) +
geom_text(x = 510, y = 26, label = paste0("R^2 = ", round(Sat_Cal.Stats_Full$r.squared,3))) +
geom_text(x = 510, y = 24, label = paste0("p < 2.2e-16"))
Here we can see a high correlation between the two variables, with a statistically significant p-value of <2.2e-16. An R^2 of 0.766 also indicates a positive correlation. We can now apply this technique to any of the variables we have an interest in! For example, saturated fat intake as a function of protein:
prot_sat.lm <-lm(qsh$Protein..g. ~ qsh$Saturated.Fat..g.)
Sat_Prot.Stats <- cor.test(qsh$Protein..g., qsh$Saturated.Fat..g., method = 'pearson')
Sat_Prot.Stats_Full <- summary(lm(qsh$Protein..g. ~ qsh$Saturated.Fat..g.))
Sat_Prot.Scatter <- qsh %>%
select(Protein..g., Saturated.Fat..g.) %>%
ggplot(aes(x = Protein..g., y = Saturated.Fat..g.)) + geom_point(color = 'firebrick1') +
stat_smooth(method="lm", se=TRUE) + theme_bw() +
ggtitle("Saturated Fat Intake ~ Protein" , subtitle = "How well correlated are they?") +
xlab("Protein Intake (g)") + ylab("Saturated Fat Intake (g)") +
geom_text(x = 10, y = 25, label = paste0("R = ", round(Sat_Prot.Stats$estimate,3))) +
geom_text(x = 11, y = 27, label = paste0("R^2 = ", round(Sat_Prot.Stats_Full$r.squared,3))) +
geom_text(x = 11, y = 23, label = paste0("p < 2.2e-16"))
Wrapping Things Up
In summary, this post walked through the following:
- Extracting data from Apple Health and supporting applications
- Using
ggplot2
to produce beautiful graphics in a simple, easy manner - Applying
lubridate
to make date-time data points easy to process - Incorporating
dplyr
for data synthesizing and manipulation - Basic statistical methods for analyzing correlations between data points
I hope you enjoyed following this post, and I welcome any criticism, suggestions, or feedback! If thereâs another thing Iâve learned about R itâs that thereâs plenty of ways to do the same thingâŚoften times with progressive methods typically better than the last. I plan to expand on this topic in the future and begin exploring other datasets with greater application to more advanced methods. Thanks for taking the time to read this, hope you learned a thing or two!