Handling Time Series Data with Rolling Averages

Introduction

When looking at time series data it can be useful to consolidate high frequency data into lower frequency increments. At first this sounds contradictory to common practice, after all isn’t more data always better? It turns out this isn’t always the case and this can be where a rolling average (also known as a moving average) can be helpful.

To note, this tutorial will only cover use cases where time series values are of interest.

Consider the following use cases:

  • Data is received in minute-by-minute intervals, but the end result is only interested in daily values
  • There are millions of data points taken at millisecond frequency, but due to the sheer size of the dataset it must be shrunken in order to be plotted graphically
  • Observations span years of time and interest is only in monthly occurrences

These examples are more common than expected in the real world. Why plot millions of datapoints in a time series when your eyes can only process a fraction of them in the final output? Not to mention the sheer graphics overload placed as an unnecessary burden on the computer itself when rendering in packages like ggplot2.

To address this, using a rolling average is very useful and there are many places to look on how to execute them. The following tutorial is only one interpretation for doing so. The following libraries will be used:

Create Example Time Series

library(dplyr) # To utilize tidy grammar and piping i.e. %>%
library(lubridate) # To make use of the floor_date() function

Other areas of this website will go over the use of dplyr, ggplot2, and how lubridate can be a godsend for working with dates in R. The main focus of this will be using these in combination with the floor_date() function. floor_date (and it’s logical partners, round_date and ceiling_date) takes date/time inputs and rounds them down to a specified baseline. Let’s say a vector of values is set up in seconds incrementally by 1 from 0 to 120 and floor_date was applied.

example <- seq(from = ymd_hms("2020-01-01 00:00:00"), to = ymd_hms("2020-01-01 00:02:00"), by = "sec")

example

Example Time Series Data

Now apply the floor_date command in conjunction with dplyr’s piping operator and observe how the data changes:

example %>% 
  floor_date(unit = "min")

Example Floor Date

All of the data occurring before the first minute mark gets rounded down to 00:00:00 and all of the data including and after the first minute gets rounded down to 00:01:00.

Apply Rolling Average

This isn’t quite useful yet, after all the only thing that’s happened is essentially eliminated any useful differentiation in this example vector. This time, combine the time series data with a random normal distribution and increase the time interval to an hour, maintaining an increment of 1 second and assign it to a dataframe, df:

date.time <- seq(from = ymd_hms("2020-01-01 00:00:00"), to = ymd_hms("2020-01-01 01:00:00"), by = "sec")
normaldist <- rnorm(n = length(date.time), mean = 100, sd = 5)

df <- data.frame(date.time, normaldist)

By combining floor_date with group_by and summarise from dplyrs litany of tidy functions, averaging can be specified in a very intuitive manner. Below observe how the date.time data stream is consolidated and the normaldist data stream is redefined as a mean of its original form:

rolling_avg.df <- df %>% 
  group_by(date.time = floor_date(x = date.time, unit = "min")) %>% 
  summarise(normaldist = mean(normaldist, na.rm = T))

rolling_avg.df

Final Rolling Average

Double Checking the Output

Still not convinced? Perhaps it seems a little too good to be true? Please feel free to do some sanity checks as described below to observe for yourself and cross-validate the accuracy of this method:

# Sanity Check
mean(df$normaldist[1:60]) # Avg of dataset for first minute
rolling_avg.df[1,2]

mean(df$normaldist[61:120]) # Avg of dataset for second minute
rolling_avg.df[2,2]

Another nice thing about this method is not being tied to only using mean, median is also perfectly acceptable if it’s better for the needs at hand.

Other Resources

While this tutorial does not cover alternate methods of applying rolling average functions, it is worth noting their existence such as the rollmean function in the zoo package, movingaverages in TTR, and ma from the forecast package.

Rolling Pin