When looking at time series data it can be useful to consolidate high frequency data into lower frequency increments. At first this sounds contradictory to common practice, after all isn’t more data always better? It turns out this isn’t always the case and this can be where a rolling average (also known as a moving average) can be helpful.
To note, this tutorial will only cover use cases where time series values are of interest.
Consider the following use cases:
- Data is received in minute-by-minute intervals, but the end result is only interested in daily values
- There are millions of data points taken at millisecond frequency, but due to the sheer size of the dataset it must be shrunken in order to be plotted graphically
- Observations span years of time and interest is only in monthly occurrences
These examples are more common than expected in the real world. Why plot millions of datapoints in a time series when your eyes can only process a fraction of them in the final output? Not to mention the sheer graphics overload placed as an unnecessary burden on the computer itself when rendering in packages like
To address this, using a rolling average is very useful and there are many places to look on how to execute them. The following tutorial is only one interpretation for doing so. The following libraries will be used:
Create Example Time Series
library(dplyr) # To utilize tidy grammar and piping i.e. %>% library(lubridate) # To make use of the floor_date() function
Other areas of this website will go over the use of
ggplot2, and how
lubridate can be a godsend for working with dates in R. The main focus of this will be using these in combination with the
floor_date (and it’s logical partners,
ceiling_date) takes date/time inputs and rounds them down to a specified baseline. Let’s say a vector of values is set up in seconds incrementally by 1 from 0 to 120 and
floor_date was applied.
example <- seq(from = ymd_hms("2020-01-01 00:00:00"), to = ymd_hms("2020-01-01 00:02:00"), by = "sec") example
Now apply the
floor_date command in conjunction with
dplyr’s piping operator and observe how the data changes:
example %>% floor_date(unit = "min")
All of the data occurring before the first minute mark gets rounded down to 00:00:00 and all of the data including and after the first minute gets rounded down to 00:01:00.
Apply Rolling Average
This isn’t quite useful yet, after all the only thing that’s happened is essentially eliminated any useful differentiation in this example vector. This time, combine the time series data with a random normal distribution and increase the time interval to an hour, maintaining an increment of 1 second and assign it to a dataframe,
date.time <- seq(from = ymd_hms("2020-01-01 00:00:00"), to = ymd_hms("2020-01-01 01:00:00"), by = "sec") normaldist <- rnorm(n = length(date.time), mean = 100, sd = 5) df <- data.frame(date.time, normaldist)
dplyrs litany of tidy functions, averaging can be specified in a very intuitive manner. Below observe how the
date.time data stream is consolidated and the
normaldist data stream is redefined as a
mean of its original form:
rolling_avg.df <- df %>% group_by(date.time = floor_date(x = date.time, unit = "min")) %>% summarise(normaldist = mean(normaldist, na.rm = T)) rolling_avg.df
Double Checking the Output
Still not convinced? Perhaps it seems a little too good to be true? Please feel free to do some sanity checks as described below to observe for yourself and cross-validate the accuracy of this method:
# Sanity Check mean(df$normaldist[1:60]) # Avg of dataset for first minute rolling_avg.df[1,2] mean(df$normaldist[61:120]) # Avg of dataset for second minute rolling_avg.df[2,2]
Another nice thing about this method is not being tied to only using
median is also perfectly acceptable if it’s better for the needs at hand.
While this tutorial does not cover alternate methods of applying rolling average functions, it is worth noting their existence such as the
rollmean function in the
ma from the